Running the algorithm

The previous results are quite good, based on our testing set of data, based on the testing set. However, what happens if we get lucky and choose an easy testing set? Alternatively, what if it was particularly troublesome? We can discard a good model due to poor results resulting from such an unlucky split of our data.

The cross-fold validation framework is a way to address the problem of choosing a single testing set and is a standard best-practice methodology in data mining. The process works by doing many experiments with different training and testing splits, but using each sample in a testing set only once. The procedure is as follows:

  1. Split the entire dataset into several sections called folds.
  2. For each fold in the data, execute the following steps:
    1. Set that fold aside as the current testing set
    2. Train the algorithm on the remaining folds
    3. Evaluate on the current testing set
  1. Report on all the evaluation scores, including the average score.

In this process, each sample is used in the testing set only once, reducing (but not eliminating) the likelihood of choosing lucky testing sets.

Throughout this book, the code examples build upon each other within a chapter. Each chapter's code should be entered into the same Jupyter Notebook unless otherwise specified in-text.

The scikit-learn library contains a few cross-fold validation methods. A helper function is given that performs the preceding procedure. We can import it now in our Jupyter Notebook:

from sklearn.cross_validation import cross_val_score

By cross_val_score uses a specific methodology called Stratified K-Fold to create folds that have approximately the same proportion of classes in each fold, again reducing the likelihood of choosing poor folds. Stratified K-Fold is a great default -we won't mess with it right now.

Next, we use this new function to evaluate our model using cross-fold validation:

scores = cross_val_score(estimator, X, y, scoring='accuracy') 
average_accuracy = np.mean(scores) * 100
print("The average accuracy is {0:.1f}%".format(average_accuracy))

Our new code returns a slightly more modest result of 82.3 percent, but it is still quite good considering we have not yet tried setting better parameters. In the next section, we will see how we would go about changing the parameters to achieve a better outcome.

It is quite natural for variation in results when performing data mining, and attempting to repeat experiments. This is due to variations in how the folds are created and randomness inherent in some classification algorithms. We can deliberately choose to replicate an experiment exactly by setting the random state (which we will do in later chapters). In practice, it's a good idea to rerun experiments multiple times to get a sense of the average result and the spread of the results (the mean and standard deviation) across all experiments.