Setting parameters

Almost all parameters that the user can set, letting algorithms focus more on the specific dataset, rather than only being applicable across a small and specific range of problems. Setting these parameters can be quite difficult, as choosing good parameter values is often highly reliant on features of the dataset.

The nearest neighbor algorithm has several parameters, but the most important one is that of the number of nearest neighbors to use when predicting the class of an unseen attribution. In -learn, this parameter is called n_neighbors. In the following figure, we show that when this number is too low, a randomly labeled sample can cause an error. In contrast, when it is too high, the actual nearest neighbors have a lower effect on the result:

In figure (a), on the left-hand side, we would usually expect to classify the test sample (the triangle) as a circle. However, if n_neighbors is 1, the single red diamond in this area (likely a noisy sample) causes the sample to be predicted as a diamond. In figure (b), on the right-hand side, we would usually expect to classify the test sample as a diamond. However, if n_neighbors is 7, the three nearest neighbors (which are all diamonds) are overridden by a large number of circle samples. Nearest neighbors a difficult problem to solve, as the parameter can make a huge difference. Luckily, most of the time the specific parameter value does not greatly affect the end result, and the standard values (usually 5 or 10) are often near enough.

With that in mind, we can test out a range of values, and investigate the impact that this parameter has on performance. If we want to test a number of values for the n_neighbors parameter, for example, each of the values from 1 to 20, we can rerun the experiment many times by setting n_neighbors and observing the result. The code below does this, storing the values in the avg_scores and all_scores variables.

avg_scores = [] 
all_scores = []
parameter_values = list(range(1, 21)) # Include 20
for n_neighbors in parameter_values:
estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
scores = cross_val_score(estimator, X, y, scoring='accuracy') avg_scores.append(np.mean(scores))
all_scores.append(scores)

We can then plot the relationship between the value of n_neighbors and the accuracy. First, we tell the Jupyter Notebook that we want to show plots inline in the notebook itself:

%matplotlib inline

We then import pyplot from the matplotlib library and plot the parameter values alongside average scores:

from matplotlib import pyplot as plt plt.plot(parameter_values,  avg_scores, '-o')

While there is a lot of variance, the plot shows a decreasing trend as the number of neighbors increases. With regard to the variance, you can expect large amounts of variance whenever you do evaluations of this nature. To compensate, update the code to run 100 tests, per value of n_neighbors.