书名：Learning Data Mining with Python（Second Edition）
作者名：Robert Layton
本章字数：707字
更新时间：2021-07-02 23:40:05

Testing the algorithm

When we evaluated the affinity analysis algorithm of the earlier section, our aim was to explore the current dataset. With this classification, our problem is different. We want to build a model that will allow us to classify previously unseen samples by comparing them to what we know about the problem.

For this reason, we split our machine-learning workflow into two stages: training and testing. In training, we take a portion of the dataset and create our model. In testing, we apply that model and evaluate how effectively it worked on the dataset. As our goal is to create a model that can classify previously unseen samples, we cannot use our testing data for training the model. If we do, we run the risk of overfitting.

Overfitting is the problem of creating a model that classifies our training dataset very well but performs poorly on new samples. The solution is quite simple: never use training data to test your algorithm. This simple rule has some complex variants, which we will cover in later chapters; but, for now, we can evaluate our OneR implementation by simply splitting our dataset into two small datasets: a training one and a testing one. This workflow is given in this section.

The scikit-learn library contains a function to split data into training and testing components:

from sklearn.cross_validation import train_test_split

This function will split the dataset into two sub-datasets, per a given ratio (which by default uses 25 percent of the dataset for testing). It does this randomly, which improves the confidence that the algorithm will perform as expected in real world environments (where we expect data to come in from a random distribution):

Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, 
    random_state=14)

We now have two smaller datasets: Xd_train contains our data for training and Xd_test contains our data for testing. y_train and y_test give the corresponding class values for these datasets.

We also specify a random_state. Setting the random state will give the same split every time the same value is entered. It will look random, but the algorithm used is deterministic, and the output will be consistent. For this book, I recommend setting the random state to the same value that I do, as it will give you the same results that I get, allowing you to verify your results. To get truly random results that change every time you run it, set random_state to None.

Next, we compute the predictors for all the features for our dataset. Remember to only use the training data for this process. We iterate over all the features in the dataset and use our previously defined functions to train the predictors and compute the errors:

all_predictors = {} 
errors = {} 
for feature_index in range(Xd_train.shape[1]): 
    predictors, total_error = train(Xd_train,
                                    y_train,
                                    feature_index) 
    all_predictors[feature_index] = predictors 
    errors[feature_index] = total_error

Next, we find the best feature to use as our One Rule, by finding the feature with the lowest error:

best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

We then create our model by storing the predictors for the best feature:

model = {'feature': best_feature,
         'predictor': all_predictors[best_feature]}

Our model is a dictionary that tells us which feature to use for our One Rule and the predictions that are made based on the values it has. Given this model, we can predict the class of a previously unseen sample by finding the value of the specific feature and using the appropriate predictor. The following code does this for a given sample:

variable = model['feature'] 
predictor = model['predictor'] 
prediction = predictor[int(sample[variable])]

Often we want to predict several new samples at one time, which we can do using the following function. It simply uses the above code, but iterate over all the samples in a dataset, obtaining the prediction for each sample:

def predict(X_test, model):
variable = model['feature']
predictor = model['predictor']
y_predicted = np.array([predictor
                        [int(sample[variable])] for sample
                        in X_test])
return y_predicted

For our testing dataset, we get the predictions by calling the following function:

y_predicted = predict(Xd_test, model)

We can then compute the accuracy of this by comparing it to the known classes:

accuracy = np.mean(y_predicted == y_test) * 100 
print("The test accuracy is {:.1f}%".format(accuracy))

This algorithm gives an accuracy of 65.8 percent, which is not bad for a single rule!