- Learning Data Mining with Python(Second Edition)
- Robert Layton
- 391字
- 2021-07-02 23:40:10
Engineering new features
In the previous few examples, we saw that changing the features can have quite a large impact on the performance of the algorithm. Through our small amount of testing, we had more than 10 percent variance just from the features.
You can create features that come from a simple function in pandas by doing something like this:
dataset["New Feature"] = feature_creator()
The feature_creator function must return a list of the feature's value for each sample in the dataset. A common pattern is to use the dataset as a parameter:
dataset["New Feature"] = feature_creator(dataset)
You can create those features more directly by setting all the values to a single default value, like 0 in the next line:
dataset["My New Feature"] = 0
You can then iterate over the dataset, computing the features as you go. We used
this format in this chapter to create many of our features:
for index, row in dataset.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
# Some calculation here to alter row
dataset.set_value(index, "FeatureName", feature_value)
Keep in mind that this pattern isn't very efficient. If you are going to do this, try all of your features at once.
A common best practice is to touch every sample as little as possible, preferably only once.
Some example features that you could try and implement are as follows:
- How many days has it been since each team's previous match? Teams may be tired if they play too many games in a short time frame.
- How many games of the last five did each team win? This will give a more stable form of the HomeLastWin and VisitorLastWin features we extracted earlier (and can be extracted in a very similar way).
- Do teams have a good record when visiting certain other teams? For instance, one team may play well in a particular stadium, even if they are the visitors.
If you are facing trouble extracting features of these types, check the pandasdocumentation at http://pandas.pydata.org/pandas-docs/stable/ for help. Alternatively, you can try an online forum such as Stack Overflow for assistance.
More extreme examples could use player data to estimate the strength of each team's sides to predict who won. These types of complex features are used every day by gamblers and sports betting agencies to try to turn a profit by predicting the outcome of sports matches.