31
loading...
This website collects cookies to deliver better user experience
# last week
pip install pandas matplotlib squarify seaborn
# new libs
pip install scipy sklearn catboost statsmodels
cars['age'] = datetime.now().year - cars['year']
cars = cars.drop('year', 1)
cars = cars.drop('make', 1)
cars = cars.drop('model', 1)
dropna
will remove all lines with empty or null values.from scipy import stats
cars = cars.dropna()
cars = cars[stats.zscore(cars.price) < 3]
cars = cars[stats.zscore(cars.hp) < 3]
cars = cars[stats.zscore(cars.mileage) < 3]
get_dummies
.offerTypeDummies = pd.get_dummies(cars.offerType)
cars = cars.join(offerTypeDummies)
cars = cars.drop('offerType', 1)
gearDummies = pd.get_dummies(cars.gear)
cars = cars.join(gearDummies)
cars = cars.drop('gear', 1)
# Replacing to avoid late collisions with Makes
cars['fuel'] = cars['fuel'].replace('Others', 'OthersFuel')
fuelDummies = pd.get_dummies(cars.fuel)
cars = cars.join(fuelDummies)
cars = cars.drop('fuel', 1)
seaborn heatmap
. It will show us graphically which variables are positively or negatively correlated.import seaborn as sns
sns.heatmap(cars.corr(), annot=True, cmap='coolwarm')
plt.show()
seaborn jointplot
.sns.set_theme(style="darkgrid")
sns.jointplot(x="hp", y="price", data=cars,
kind="reg", color="m", line_kws={'color': 'red'})
plt.show()
from sklearn.model_selection import train_test_split
X = cars.drop('price', 1)
Y = cars.price
X_train, X_test, y_train, y_test = train_test_split(
X, Y, train_size=0.7, test_size=0.3, random_state=100)
LinearRegression
model, we will pass the train data to the fit method and then the test data to predict.from sklearn import linear_model
from sklearn.metrics import r2_score
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
print(r2_score(y_true=y_test, y_pred=y_pred)) # 0.81237
Regressor
from CatBoost
. The model's created with some numbers that we adjusted by testing. Similar to the previous one, fit the model with the train data and check the score, resulting in 0.92416.from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=6542, learning_rate=0.03)
model.fit(
X_train, y_train,
eval_set=(X_test, y_test),
)
print(model.score(X, Y)) # 0.92416
statsmodels
, we will change X's value and take only mileage, hp, and age. The difference is almost 10% better than with the previous values.import statsmodels.api as sm
X = cars[['mileage', 'hp', 'age']]
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print(model.rsquared) # 0.91823
CatBoost
. It will take much longer and use more space. We would have more than 700 feature columns. But it is worth it if you are after accuracy.makeDummies = pd.get_dummies(cars.make)
cars = cars.join(makeDummies)
cars = cars.drop('make', 1)
modelDummies = pd.get_dummies(cars.model)
cars = cars.join(modelDummies)
cars = cars.drop('model', 1)
# the rest of the features, just as before
# split train and test data
model = CatBoostRegressor(iterations=6542, learning_rate=0.03)
model.fit(
X_train, y_train,
eval_set=(X_test, y_test),
)
print(model.score(X, Y)) # 0.9664
sorted_feature_importance = model.get_feature_importance().argsort(
)[-20:]
plt.barh(
cars.columns[sorted_feature_importance],
model.feature_importances_[sorted_feature_importance]
)
plt.xlabel("Feature Importance")
plt.show()
CatBoost
again and use their predict
method for that. We would need to go all the way again by transforming all the data with dummies, so we'll summarize. This process would be extracted and performed equally for training, test, or actual data in a real-world app.realData = pd.DataFrame.from_records([
{'mileage': 87000, 'make': 'Volkswagen', 'model': 'Gold',
'fuel': 'Gasoline', 'gear': 'Manual', 'offerType': 'Used',
'price': 12990, 'hp': 125, 'year': 2015},
{'mileage': 230000, 'make': 'Opel', 'model': 'Zafira Tourer',
'fuel': 'CNG', 'gear': 'Manual', 'offerType': 'Used',
'price': 5200, 'hp': 150, 'year': 2012},
{'mileage': 5, 'make': 'Mazda', 'model': '3', 'hp': 122,
'gear': 'Manual', 'offerType': 'Employee\'s car',
'fuel': 'Gasoline', 'price': 20900, 'year': 2020}
])
realData = realData.drop('price', 1)
realData['age'] = datetime.now().year - realData['year']
realData = realData.drop('year', 1)
# all the other transformations and dummies go here
fitModel = pd.DataFrame(columns=cars.columns)
fitModel = fitModel.append(realData, ignore_index=True)
fitModel = fitModel.fillna(0)
preds = model.predict(fitModel)
print(preds) # [12213.35324984 5213.058479 20674.08838559]
sklearn
performs a bit worse. And the other two are pretty similar in results - if excluding makes and models - but not on time spent training. So it might be a crucial aspect to consider when choosing between them.CatBoost
model with all the available data. It might take up to a minute, but it can be stored in a file and instantly loaded when needed.pandas
is quite simple. Then, there are some steps to perform: describe, explore manually, look at the values, check for empty or nulls.