29
loading...
This website collects cookies to deliver better user experience
In just a few hours, you'd be able to understand and build basic regression, and classification models with the optimal hyperparameters.
When evaluating a machine learning model, training and testing on the same dataset is not a great idea. Why? Let us draw a relatable analogy.
Here’s the answer to the question ‘Why can we not evaluate a model on the same data that it was trained on?’
train_test_split
method, in our very friendly and nifty library, scikit-learn.train_test_split
splits the available data into two sets: the train and test sets in certain proportions.And in doing so, we are making sure that we test the model’s performance on unseen data.
KNeighborsClassifier
in scikit-learn on the iris dataset.from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
# read in the iris data
iris = load_iris()
# create X (features) and y (response)
X = iris.data
y = iris.target
random_state= 4
. Setting the random_state
ensures reproducibility.In this case, it ensures that the records that go into the train and test sets stay the same every time our code is run.
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 4)
KNeighborsClassifier
with n_neighbors=9
and fit the classifier on the training set and predict on the test set.knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
# Output
0.9736842105263158
random_state
to a different value. What do you think the accuracy score would be?random_state
and check for yourselves. It'd be a different accuracy score this time.random_state
to another value, we would get another value for the accuracy score. The evaluation metric thus obtained is therefore susceptible to high variance.
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle = False).split(range(25))
# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))
# Output
Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
cross_val_score
does this by default.K
? That is, to search for the optimal value of n_neighbors
?K
in KNN classifier is the number of neighbors (n_neighbors
) that we take into account for predicting the class label of the test sample. Not to be confused with the K
in K-fold cross-validation.from sklearn.model_selection import cross_val_score
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
# Output
[1. 0.93333333 1. 1. 0.86666667 0.93333333
0.93333333 1. 1. 1. ]
# use average accuracy as an estimate of out-of-sample accuracy
print(scores.mean())
# Output
0.9666666666666668
n_neighbors
, as shown below.# search for an optimal value of K for KNN
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
print(k_scores)
# Output k_scores
[0.96, 0.9533333333333334, 0.9666666666666666, 0.9666666666666666, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9800000000000001, 0.9666666666666666, 0.9666666666666666, 0.9733333333333334, 0.96, 0.9666666666666666, 0.96, 0.9666666666666666, 0.9533333333333334, 0.9533333333333334, 0.9533333333333334]
import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
n_neighbors (K)
values from 13 to 20 yield higher accuracy, especially K
=13,18 and 20.As a larger value of K
yields a less complex model, we choose K
=20.This process of searching for the optimal values of hyperparameters is called hyperparameter tuning.
K
that resulted in higher mean accuracy score under 10-fold cross validation. In KNN classifiers, setting a very small value for K
will make the model needlessly complex, and a very large value of K
would result in a model with high bias that yields suboptimal performance.
K
=13,18 and 20 gave the highest accuracy score, close to 0.98, we decided to choose K
=20 as a larger value of K
would yield a less complex model.from sklearn.model_selection import GridSearchCV
param_grid
), a Python dictionary, whose key is the name of the hyperparameter whose best value we’re trying to find and the value is the list of possible values that we would like to search over for the hyperparameter.# define the parameter values that should be searched
k_range = list(range(1, 31))
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
GridSearchCV
. Note that we specify the param_grid
instead of the n_neighbors
argument that we had specified for cross_val_score
earlier. param_grid
is a dictionary whose key is n_neighbors
and the value is a list of possible values of n_neighbors
. Therefore, specifying the param_grid
ensures that the value at index i
is fetched as the value of n_neighbors
in the i_th run.# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
cv_results_
attribute to get the mean accuracy score after 10-fold cross-validation, standard deviation and the parameter values. n_neighbors
=1 to 10 are shown below.# fit the grid with data
grid.fit(X, y)
# view the results as a pandas DataFrame
import pandas as pd
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Output
mean_test_score std_test_score params
0 0.960000 0.053333 {'n_neighbors': 1}
1 0.953333 0.052068 {'n_neighbors': 2}
2 0.966667 0.044721 {'n_neighbors': 3}
3 0.966667 0.044721 {'n_neighbors': 4}
4 0.966667 0.044721 {'n_neighbors': 5}
5 0.966667 0.044721 {'n_neighbors': 6}
6 0.966667 0.044721 {'n_neighbors': 7}
7 0.966667 0.044721 {'n_neighbors': 8}
8 0.973333 0.032660 {'n_neighbors': 9}
9 0.966667 0.044721 {'n_neighbors': 10}
cross_val_score
, we tried eyeballing the accuracy scores to identify the best hyperparameters and to make it easier, we plotted the value of hyperparameters vs the respective cross-validated accuracy scores! Sounds good but doesn’t seem to be a great option though!
best_score_
, the highest cross-validated accuracy scorebest_params_
, the optimal value for the hyperparameters, andbest_estimator_
, which is the best model that has the best hyperparameter.# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
# Output
0.9800000000000001
{'n_neighbors': 13}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=13, p=2,
weights='uniform')
K
=13 has been chosen, remember, K
=13 was one of the values of K
that gave highest cross-validated accuracy score.✔n_neighbors
.What if there were many such hyperparameters?
We may think, “Why not tune each hyperparameter independently?”
n_neighbors
, let's search for the optimal weighting strategy as well.‘uniform’
where all points are weighted equally and ‘distance’
option weights points by the inverse of their distance.# define the parameter values that should be searched
k_range = list(range(1, 31))
weight_options = ['uniform', 'distance']
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range, weights=weight_options)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'weights': ['uniform', 'distance']}
# instantiate and fit the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
grid.fit(X, y)
# view the results
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Results
mean_test_score std_test_score params
0 0.960000 0.053333 {'n_neighbors': 1, 'weights': 'uniform'}
1 0.960000 0.053333 {'n_neighbors': 1, 'weights': 'distance'}
2 0.953333 0.052068 {'n_neighbors': 2, 'weights': 'uniform'}
3 0.960000 0.053333 {'n_neighbors': 2, 'weights': 'distance'}
4 0.966667 0.044721 {'n_neighbors': 3, 'weights': 'uniform'}
5 0.966667 0.044721 {'n_neighbors': 3, 'weights': 'distance'}
6 0.966667 0.044721 {'n_neighbors': 4, 'weights': 'uniform'}
7 0.966667 0.044721 {'n_neighbors': 4, 'weights': 'distance'}
8 0.966667 0.044721 {'n_neighbors': 5, 'weights': 'uniform'}
9 0.966667 0.044721 {'n_neighbors': 5, 'weights': 'distance'}
10 0.966667 0.044721 {'n_neighbors': 6, 'weights': 'uniform'}
11 0.966667 0.044721 {'n_neighbors': 6, 'weights': 'distance'}
12 0.966667 0.044721 {'n_neighbors': 7, 'weights': 'uniform'}
13 0.966667 0.044721 {'n_neighbors': 7, 'weights': 'distance'}
14 0.966667 0.044721 {'n_neighbors': 8, 'weights': 'uniform'}
15 0.966667 0.044721 {'n_neighbors': 8, 'weights': 'distance'}
16 0.973333 0.032660 {'n_neighbors': 9, 'weights': 'uniform'}
17 0.973333 0.032660 {'n_neighbors': 9, 'weights': 'distance'}
18 0.966667 0.044721 {'n_neighbors': 10, 'weights': 'uniform'}
n_neighbors
and 2 possible values for weights)# examine the best model
print(grid.best_score_)
print(grid.best_params_)
# best score and best parameters
0.9800000000000001
{'n_neighbors': 13, 'weights': 'uniform'}
n_neighbors
=13 and weights
= ‘uniform’.Now, let's suppose we have to tune 4 hyperparameters and we have a list of 10 possible values for each of the hyperparameters.
This process creates 10*10*10*10 =10,000 models and when we run 10 fold cross-validation, there are 100,000 predictions made.
Clearly, things do scale up very quickly and can soon become computationally infeasible.
M
parameters; Let p_1, p_2,p_3, …, p_M
be the M
parameters. p_1
be n1
, for p_2
be n2
, and so on, with nM
values for p_M
.M
parameters, we decide to freeze the values of all hyperparameters except one, say the M_th parameter p_M
. So, Grid Search involves searching through the list of nM
values for the M_th hyperparameter; And nM
models are created.p_M
and p_(M-1)
). We now have to search through all possible combinations of p_M
and p_(M-1)
, each having nM
and n_(M-1)
possible values that we could search over.p_M-1
and search through all values for p_M
; To account for all possible combinations, we should repeat the procedure for all n_M-1
values for p_M-1
. So, this process would leave us with n_(M-1) * nM
models.M
hyperparameters, we would have n1*n2*n3*…*n_M
models. This is why we said that things could scale up quickly and become computationally intractable with Grid Search.GridSearchCV
, not all parameter values are tried out in RandomizedSearchCV
, but rather a fixed number of parameter settings is sampled from the specified distributions/ list of parameters.n_iter
. There's a quality vs computational cost trade-off in picking n_iter
.A very small value of n_iter
would imply that we’re more likely to find a suboptimal solution, because we are actually considering too few combinations.
A very high value of n_iter would mean we can ideally get closer to finding the best hyperparameters that yield the best model, but this again comes with a high computation cost as before.
In fact, if we set n_iter= n1*n2*n3*…*n_M
from the previous example, then, we’re essentially considering all possible hyperparameter combinations and now Randomized Search and Grid Search are equivalent.
n_neighbors
. And now, let us implement Randomized Search in scikit-learn and do the following steps, as we did for Grid Search.from sklearn.model_selection import RandomizedSearchCV
# specify "parameter distributions" rather than a "parameter grid"
param_dist = dict(n_neighbors=k_range, weights=weight_options)
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
rand.fit(X, y)
pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
#DataFrame
mean_test_score std_test_score params
0 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 16}
1 0.966667 0.033333 {'weights': 'uniform', 'n_neighbors': 22}
2 0.980000 0.030551 {'weights': 'uniform', 'n_neighbors': 18}
3 0.966667 0.044721 {'weights': 'uniform', 'n_neighbors': 27}
4 0.953333 0.042687 {'weights': 'uniform', 'n_neighbors': 29}
5 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 10}
6 0.966667 0.044721 {'weights': 'distance', 'n_neighbors': 22}
7 0.973333 0.044222 {'weights': 'uniform', 'n_neighbors': 14}
8 0.973333 0.044222 {'weights': 'distance', 'n_neighbors': 12}
9 0.973333 0.032660 {'weights': 'uniform', 'n_neighbors': 15}
# examine the best model
print(rand.best_score_)
print(rand.best_params_)
# Output
0.9800000000000001
{'weights': 'uniform', 'n_neighbors': 18}
n_neighbors= 18
, which is also one of the optimal values that we got when we initially searched for the optimal value of n_neighbors
. Maybe we just got lucky?
What is the guarantee that we will always get the best results?
Ah, this question makes perfect sense, doesn’t it?
RandomizedSearchCV
for multiple times and see how many times we really end up getting lucky!# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in range(20):
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, return_train_score=False)
rand.fit(X, y)
best_scores.append(round(rand.best_score_, 3))
Let us examine all the 20 best scores now.
print(best_scores)
# Output: Best Scores
[0.973, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.973, 0.98, 0.973, 0.973, 0.98, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.973]
This observation convinces us that even though Randomized Search may not always give the hyperparameters of the best performing model, the models obtained by using these hyperparameters do not perform much worse compared to the best model obtained from Grid Search.
In essence, these may not be the best hyperparameters, but certainly close to the best hyperparameters, except that these are found under resource-constrained settings.