๐Ÿ† ์ž๊ฒฉ์ฆ, ์–ดํ•™

[๋น…๋ฐ์ดํ„ฐ ๋ถ„์„๊ธฐ์‚ฌ] ์‹ค๊ธฐ 4ํšŒ - 2์œ ํ˜• ๊ฒฐ์ธก์น˜ ๋Œ€์ฒด,drop

๋ฐ์ดํ„ฐํŒ์Šค 2024. 8. 20. 17:54

 

x_train=train.drop(columns=['Segmentation'])
x_test=test
y_train=train['Segmentation']

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
 

 

print(x_train.info())
print(x_test.info())
print(y_train.info())
 
๋Œ€ํ‘œ์‚ฌ์ง„ ์‚ญ์ œ

์‚ฌ์ง„ ์„ค๋ช…์„ ์ž…๋ ฅํ•˜์„ธ์š”.

๊ฒฐ์ธก์น˜๊ฐ€ ์—„์ฒญ ๋งŽ๋‹ค ๋“œ๋””์–ด ๋‚˜์™”๊ตฌ๋‚˜ ๊ฒฐ์ธก์น˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€!

print(x_train.describe())
print(y_train.describe())
print(x_test.describe())
 
๋Œ€ํ‘œ์‚ฌ์ง„ ์‚ญ์ œ

์‚ฌ์ง„ ์„ค๋ช…์„ ์ž…๋ ฅํ•˜์„ธ์š”.

์ด์ƒ์น˜๊ฐ’ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ดค์œผ๋‚˜ ์—†์—ˆ์Œ

 

print(x_train.isnull().sum())
print(x_test.isnull().sum())
 
 
์‚ฌ์ง„ ์‚ญ์ œ

์‚ฌ์ง„ ์„ค๋ช…์„ ์ž…๋ ฅํ•˜์„ธ์š”.

ํ•˜์ง€๋งŒ ๊ฒฐ์ธก์น˜๊ฐ’ ์—„์ฒญ ๋งŽ๊ณ ...

# x_train : Ever_Married, Graduated, Profession, Work_Experience(์ˆ˜์น˜), Family_Size(์ˆ˜์น˜), Var_1 ๊ฒฐ์ธก์น˜
# x_test : Ever_Married, Graduated, Profession, Work_Experience, Family_Size, Var_1
x_train=x_train.drop(columns=['Work_Experience'])
x_test=x_test.drop(columns=['Work_Experience'])

family_median=x_train['Family_Size'].median()
ever_mode=x_train['Ever_Married'].mode()
graduated_mode=x_train['Graduated'].mode()
var_mode=x_train['Var_1'].mode()
profession_mode=x_train['Profession'].mode()

x_train['Family_Size']=x_train['Family_Size'].fillna(family_median)
x_train['Ever_Married']=x_train['Ever_Married'].fillna(ever_mode[0])
x_train['Graduated']=x_train['Graduated'].fillna(graduated_mode[0])
x_train['Var_1']=x_train['Var_1'].fillna(var_mode[0])
x_train['Profession']=x_train['Profession'].fillna(profession_mode[0])

x_test['Family_Size']=x_test['Family_Size'].fillna(family_median)
x_test['Ever_Married']=x_test['Ever_Married'].fillna(ever_mode[0])
x_test['Graduated']=x_test['Graduated'].fillna(graduated_mode[0])
x_test['Var_1']=x_test['Var_1'].fillna(var_mode[0])
x_test['Profession']=x_test['Profession'].fillna(profession_mode[0])
 

** ์ฃผ์˜์‚ฌํ•ญ : train ๋ฐ์ดํ„ฐ์˜ ์ค‘์•™๊ฐ’์œผ๋กœ test ๋ฐ์ดํ„ฐ๋„ ๋ณ€๊ฒฝํ•ด์ค˜์•ผ ํ•จ **

#	์—ฐ์†ํ˜•	๋ณ€์ˆ˜	:	์ค‘์•™๊ฐ’,	ํ‰๊ท ๊ฐ’
# df['๋ณ€์ˆ˜๋ช…'].median() 
# df['๋ณ€์ˆ˜๋ช…'].mean() 
#	๋ฒ”์ฃผํ˜•	๋ณ€์ˆ˜	:	์ตœ๋นˆ๊ฐ’
#	df['๋ณ€์ˆ˜๋ช…']	=	df['๋ณ€์ˆ˜๋ช…'].fillna(๋Œ€์ฒดํ•  ๊ฐ’)
 
## ์ค‘์•™๊ฐ’ ๋Œ€์ฒด ์˜ˆ์ œ ##
med_age	=	x_train['age'].median()
x_train['age']	=	x_train['age'].fillna(med_age)
x_test['age']	=	x_test['age'].fillna(med_age)	
 

**์ฃผ์˜์‚ฌํ•ญ: train ๋ฐ์ดํ„ฐ์˜ ์ตœ๋นˆ๊ฐ’์˜ [0]๋ฅผ ๊ฐ€์ ธ์™€์•ผ ํ•จ **

ever_mode=x_train['Ever_Married'].mode()
x_test['Ever_Married']=x_test['Ever_Married'].fillna(ever_mode[0])	#	์ตœ๋นˆ๊ฐ’	[0]	์ฃผ์˜
 

์•„๋ฌดํŠผ ๊ฒฐ์ธก์น˜ ๋‹ค ์ฑ„์›Œ์ฃผ๊ณ , ๋ฒ„๋ฆด ์นผ๋Ÿผ์„ ๋ฒ„๋ ธ์Œ, work_experience๋ฅผ ๋Œ€์ฒดํ•ด๋„ ๋˜์ง€๋งŒ ๊ฐ’์ด ์ปค์„œ ๊ทธ๋ƒฅ ๋ฒ„๋ ธ๋‹ค

์ด์ƒ์น˜๋Š” ์—†์œผ๋‹ˆ ํŒจ์Šคํ•˜๊ณ  ๋ณ€์ˆ˜ ์ œ๊ฑฐ, ์›ํ•ซ ์ธ์ฝ”๋”ฉ ํ•ด์ค˜์•ผํ•จ (์ธ์ฝ”๋”ฉ ์‚ฌ์‹ค ๊นŒ๋จน์–ด์„œ ์˜ค๋ฅ˜๋‚˜์„œ ๋‹ค์‹œ ์˜ฌ๋ผ๊ฐ€์„œ ํ•จ ใ…Žใ…Ž..)

ID=x_test['ID'].copy()
x_train=x_train.drop(columns=['ID'])
x_test=x_test.drop(columns=['ID'])
 

ID ์นผ๋Ÿผ์€ ํ•„์š”์—†์œผ๋‹ˆ ๋“œ๋žํ•ด์ฃผ์ž

 

df	=	df.drop(columns	=	['๋ณ€์ˆ˜1','๋ณ€์ˆ˜2'])		
df	=	df.drop(['๋ณ€์ˆ˜1','๋ณ€์ˆ˜2'],	axis=1)	
 

<ํ•„์š”์—†๋Š” ์นผ๋Ÿผ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์œ„์™€ ๊ฐ™๋‹ค>

 

x_train=pd.get_dummies(x_train)
x_test=pd.get_dummies(x_test)

print(x_train.info())
print(x_test.info())
 

์›ํ•ซ ์ธ์ฝ”๋”ฉ ํ•ด๋ณด๊ณ  x_train, x_test ์นผ๋Ÿผ ๊ฐœ์ˆ˜์™€ ์ˆœ์„œ ๋น„๊ตํ–ˆ๋Š”๋ฐ ์ผ์น˜ํ•ด์„œ reindex๊ฐ€ ํ•„์š” ์—†์Œ

 

from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val=train_test_split(x_train,
                                             y_train,
                                             stratify=y_train,
                                             test_size=0.2,
                                             random_state=2024)
print(x_train.shape)
print(x_val.shape)
print(y_train.shape)
print(y_val.shape)
 

๋ถ„๋ฅ˜๋ถ„์„ ์ด๋‹ˆ๊นŒ stratify ๊ผญ ์จ์ฃผ์ž!

 

from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(random_state=2024)
model.fit(x_train,y_train)
 
y_pred=model.predict(x_val)
from sklearn.metrics import f1_score
f1=f1_score(y_val,y_pred, average='macro')
print(f1)
 

**์ฃผ์˜์‚ฌํ•ญ:y๊ฐ’์ด ๋‹ค์ค‘๋ถ„๋ฅ˜๋‹ˆ๊นŒ macro ์žŠ์ง€๋ง๊ณ  ์“ฐ๊ธฐ**

 

y_result=model.predict(x_test)
result=pd.DataFrame({'ID':ID,'Segmentation':y_result})
result.to_csv('datafox',index=False)
pd.read_csv("datafox")
 

๋!