๐Ÿ† ์ž๊ฒฉ์ฆ, ์–ดํ•™

[๋น…๋ฐ์ดํ„ฐ ๋ถ„์„๊ธฐ์‚ฌ] ์‹ค๊ธฐ 7ํšŒ - 2์œ ํ˜• RandomForestRegressor

๋ฐ์ดํ„ฐํŒ์Šค 2024. 8. 20. 17:52

 

x_train=train.drop(['์ด์šฉ๊ธˆ์•ก'],axis=1)
y_train=train['์ด์šฉ๊ธˆ์•ก']
x_test=test

print(x_train.info())
print(x_test.info())
print(y_train.info())
 
 
  1. ์ผ๋‹จ train ๋ฐ์ดํ„ฐ๋ฅผ x_train๊ณผ y_train์œผ๋กœ ๋ถ„๋ฆฌํ•œ๋‹ค
  2. info()๋กœ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ํ™•์ธํ•œ๋‹ค >> object, category ์žˆ์œผ๋ฉด ์›ํ•ซ ์ธ์ฝ”๋”ฉ ํ•ด์ค˜์•ผํ•จ
x_train.head()
x_test.head()
y_train.head()
 

3. head๋กœ ๋ฐ์ดํ„ฐ ์–ด๋–ป๊ฒŒ ์ƒ๊ฒผ๋Š”์ง€ ๊ฐ„๋žตํ•˜๊ฒŒ ๋ด์ฃผ๊ณ 

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
 

4. x_train๊ณผ y_train ๊ฐœ์ˆ˜ ๋งž๋Š”์ง€๋„ ํ™•์ธ

print(x_train.describe())
print(x_test.describe())
print(y_train.describe())
 

5. describe๋กœ ๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰ ํ™•์ธํ•ด์„œ x_train๊ณผ x_test์˜ min, max ๊ฐ’์„ ๋น„๊ตํ•ด์„œ ์ด์ƒ์น˜๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ

y_train.value_counts()
 

6. y_train ๊ฐ’ ํ™•์ธํ•ด์„œ ์—ฐ์†ํ˜•์ž„์„ ํ™•์ธ > ํšŒ๊ท€๋กœ ํ’€์–ด์•ผ ํ•จ

print(x_train.isnull().sum())
print(x_test.isnull().sum())
print(y_train.isnull().sum())
 

7. ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ด„ >> ์žˆ์œผ๋ฉด ๊ฒฐ์ธก์น˜ ๋Œ€์ฒด

 

Id	=	x_test['ID'].copy()
x_train	=	x_train.drop(columns	=	['ID'])	#	drop(columns	=	['๋ณ€์ˆ˜1','๋ณ€์ˆ˜2'])	๋ณ€์ˆ˜	์ถ”๊ฐ€ํ•ด์„œ	์—ฌ๋Ÿฌ๊ฐœ	์‚ญ์ œ	๊ฐ€๋Šฅ	
x_test	=	x_test.drop(columns	=	['ID'])
 

8. ๋ณ€์ˆ˜์ฒ˜๋ฆฌ

๋ถˆํ•„์š”ํ•œ ๋ณ€์ˆ˜(columns) ์ œ๊ฑฐ

id ๋Š” ๋ถˆํ•„์š”ํ•œ ๋ณ€์ˆ˜์ด๋ฏ€๋กœ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

๋‹จ, test ์…‹์˜ id๊ฐ€ ๋‚˜์ค‘์— ์ œ์ถœ์ด ํ•„์š”ํ•˜๋‹ค๋ฉด ๋ณ„๋„๋กœ ์ €์žฅํ•ด๋‘ 

x_train=pd.get_dummies(x_train)
x_test=pd.get_dummies(x_test)
print(x_train.info())
print(x_test.info())
 

9. ์›ํ•ซ ์ธ์ฝ”๋”ฉ ์ง„ํ–‰ํ•˜๊ณ  ์ž˜๋๋Š”์ง€ ํ™•์ธ(๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜, ์นผ๋Ÿผ ์ˆœ์„œ, ์นผ๋Ÿผ ๊ฐœ์ˆ˜)

๋งŒ์•ฝ x_test์˜ ๋ณ€์ˆ˜๊ฐ€ ์ˆ˜๊ฐ€ ๋” ๋งŽ์€ ๊ฒฝ์šฐ๋ฉด ์•„๋ž˜์˜ ๋ฐฉ์‹๋Œ€๋กœ ์ง„ํ–‰

x_train = x_train.reindex(columns = x_test.columns, fill_value=0) 
x_train.info()
 
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train,
                                                  y_train,
                                                  test_size=0.2,
                                                  random_state=2024)
 

10. ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ฆ์šฉ๊ณผ ํ›ˆ๋ จ์šฉ์œผ๋กœ ๋ถ„ํ• ํ•จ

from sklearn.ensemble import RandomForestRegressor
model=RandomForestRegressor(random_state=2024)
model.fit(x_train,y_train)
 

11. ๋ชจ๋ธ๋ง

y_pred=model.predict(x_val)
from sklearn.metrics import mean_squared_error
mse=mean_squared_error(y_val,y_pred)
rmse=mse**0.5
print(rmse)
 

12. ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก๊ฐ’ ๊ตฌํ•œ ๋‹ค์Œ์—, rmse๋ฅผ ๊ตฌํ•ด๋ดค์Œ

 
์‚ฌ์ง„ ์‚ญ์ œ

์‚ฌ์ง„ ์„ค๋ช…์„ ์ž…๋ ฅํ•˜์„ธ์š”.

๋ ์šฉ?? ๊ฐ’์ด ๋„ˆ๋ฌด ํฐ๋ฐ?? ์‹ถ์–ด์„œ ์ •๋‹ต ์ฝ”๋“œ๋กœ๋„ ๋Œ๋ ค๋ดค๋Š”๋ฐ ๋˜‘๊ฐ™์ด ๋‚˜์˜จ๋‹ค

#rmse๊ฐ’์ด ํฌ๋”๋ผ๋„ ๋‹นํ™ฉํ•˜์ง€ ๋ง์•„์š”..

๋‚ ์งœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ›„์ฒ˜๋ฆฌ, ์—…์ข…๋ช…์— ๋Œ€ํ•ด ํ›„์ฒ˜๋ฆฌ ๋“ฑ์˜ ์ ‘๊ทผ์„ ์ถ”๊ฐ€๋กœ ํ•ด๋ณผ ์—ฌ์ง€๊ฐ€ ์žˆ์–ด๋ณด์ž…๋‹ˆ๋‹ค~!

๋ผ๋Š”๊ตฌ๋งŒ ๊ดœ์ฐฎ๊ฒ ์ง€??

 

y_result	=	model.predict(x_test)
result	=	pd.DataFrame({'ID':	Id,	'target':	y_result})
result
result.to_csv('datafox.csv',index=False)
 

13. ๋ฐ์ดํ„ฐ ์ œ์ถœ

df2	=	pd.read_csv("datafox.csv")
print(df2.head())
 

14. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ํ•ด์„œ ์ œ๋Œ€๋กœ ์ €์žฅ๋๋Š”์ง€ ํ™•์ธ