๐Ÿ† ์ž๊ฒฉ์ฆ, ์–ดํ•™

[๋น…๋ฐ์ดํ„ฐ ๋ถ„์„๊ธฐ์‚ฌ] ์‹ค๊ธฐ 5ํšŒ - 2์œ ํ˜• x_train๊ณผ x_test ๊ฐœ์ˆ˜๊ฐ€ ๋‹ค๋ฅผ๋•Œ reindex ์‚ฌ์šฉ

๋ฐ์ดํ„ฐํŒ์Šค 2024. 8. 20. 17:54

 

 

import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e5_p2_train_.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e5_p2_test_.csv')

display(train.head(2))
test.head(2)

x_train=train.drop(columns=['price'])
y_train=train['price']
x_test=test

print(x_train.info())
print(y_train.info())
print(x_test.info())
 

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜จ ๋‹ค์Œ์— ๋‚˜๋ˆ ์คฌ๋‹ค info๋กœ ๋ฐ์ดํ„ฐ ํƒ€์ž… ํ™•์ธํ–ˆ์Œ > ์›ํ•ซ ์ธ์ฝ”๋”ฉ ํ•„์š”

#๋ฐ์ดํ„ฐ ํฌ๊ธฐ
print(x_test.describe())
print(x_train.describe())
print(y_train.describe())
 

๋ฐ์ดํ„ฐ ์ด์ƒ์น˜ ํ™•์ธํ•ด๋ณด๋ ค๊ณ  ๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰ ํ•จ์ˆ˜ ์จ์„œ min, max ๋น„๊ตํ•ด๋ดค๋Š”๋ฐ ์ด์ƒ์น˜๋Š” ๋”ฑํžˆ ์—†์—ˆ๋‹ค

print(x_train.isnull().sum())
print(x_test.isnull().sum())
print(y_train.isnull().sum())
 

๊ฒฐ์ธก์น˜๋„ ์—†์—ˆ๋‹ค

ID=x_test['ID'].copy()
x_train=x_train.drop(columns='ID')
x_test=x_test.drop(columns='ID')
 

ID๋Š” ํ•„์š” ์—†๋Š” ๋ณ€์ˆ˜๋‹ˆ๊นŒ ์ œ๊ฑฐํ•ด์คฌ๊ณ 

x_train=pd.get_dummies(x_train)
x_test=pd.get_dummies(x_test)
 

๊ทผ๋ฐ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ๋‹ค

 
์‚ฌ์ง„ ์‚ญ์ œ
 
์‚ฌ์ง„ ์‚ญ์ œ

์‚ฌ์ง„ ์„ค๋ช…์„ ์ž…๋ ฅํ•˜์„ธ์š”.

 

์›ํ•ซ ์ธ์ฝ”๋”ฉ ํ•ด์ฃผ๊ณ  ๋‚˜๋‹ˆ x_train๊ณผ x_test ์นผ๋Ÿผ ๊ฐœ์ˆ˜๋„, ์ˆœ์„œ๋„ ๋‹ค๋ฆ„... train ๋ฐ์ดํ„ฐ์—๋Š” test์— ์—†๋Š” ๋ฐ์ดํ„ฐ๋“ค์ด ์žˆ์—ˆ์Œ

์ด๊ฑธ ์–ด์ฐŒํ•˜๋‚˜ ๊ณ ๋ฏผํ•˜๋‹ค๊ฐ€

x_test	=	x_test.reindex(columns	=	x_train.columns,	fill_value=0)
 

๋งŽ์€ ์นผ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ ์ ์€ ์นผ๋Ÿผ์„ reindex ํ•ด์คฌ๋‹ค

๊ทธ๋Ÿฌ๊ณ  ๋‚˜์„  ์นผ๋Ÿผ ๊ฐœ์ˆ˜๋„, ์ˆœ์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋œ๊ฒƒ์„ ํ™•์ธํ•จ

from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train,
                                                  y_train,
                                                  test_size=0.2,
                                                  random_state=2024)
print(x_train.shape)
print(x_val.shape)
print(x_val.shape)
print(y_train.shape)
 

์ดํ›„๋กœ๋Š” x_train, y_train ๋ถ„ํ• ํ•ด์ฃผ๊ณ  (ํšŒ๊ท€ ์“ธ๊ฑฐ๋ผ์„œ stratify ํ•„์š”์—†์Œ)

from sklearn.ensemble import RandomForestRegressor
model=RandomForestRegressor(random_state=2024)
model.fit(x_train,y_train)
y_pred=model.predict(x_val)
from sklearn.metrics import mean_squared_error
mse=mean_squared_error(y_val,y_pred)
rmse=mse**0.5
print(rmse)
 
y_result=model.predict(x_test)
result=pd.DataFrame({'ID':ID,'Target':y_result})
result.to_csv('datafox',index=False)
pd.read_csv("datafox")
 

์ •๋‹ต๊นŒ์ง€ ๋งž๊ฒŒ ๋‚˜์˜ด!

 

 

5ํšŒ์˜ ํ•ต์‹ฌ

x_test	=	x_test.reindex(columns	=	x_train.columns,	fill_value=0)