- Learning Data Mining with Python(Second Edition)
- Robert Layton
- 353字
- 2021-07-02 23:40:10
Applying random forests
Random forests in scikit-learn use the Estimator interface, allowing us to use almost the exact same code as before to do cross-fold validation:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
This results in an immediate benefit of 65.3 percent, up by 2.5 points by just swapping the classifier.
Random forests, using subsets of the features, should be able to learn more effectively with more features than normal decision trees. We can test this by throwing more features at the algorithm and seeing how it goes:
X_all = np.hstack([X_lastwinner, X_teams])
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
This results in 63.3 percent—a drop in performance! One cause is the randomness inherent in random forests only chose some features to use rather than others. Further, there are many more features in X_teams than in X_lastwinner, and having the extra features results in less relevant information being used. That said, don't get too excited by small changes in percentages, either up or down. Changing the random state value will have more of an impact on the accuracy than the slight difference between these feature sets that we just observed. Instead, you should run many tests with different random states, to get a good sense of the mean and spread of accuracy values.
We can also try some other parameters using the GridSearchCV class, as we introduced in Chapter 2, Classifying using scikit-learn Estimators:
from sklearn.grid_search import GridSearchCV
parameter_space = {
"max_features": [2, 10, 'auto'],
"n_estimators": [100, 200],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}
clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("Accuracy: {0:.1f}%".format(grid.best_score_ * 100))
This has a much better accuracy of 67.4 percent!
If we wanted to see the parameters used, we can print out the best model that was found in the grid search. The code is as follows:
print(grid.best_estimator_)
The result shows the parameters that were used in the best scoring model:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=2, max_leaf_nodes=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=14, verbose=0, warm_start=False)