Selecting variables using the Means node>
In this recipe we will identify and select variables to include as model inputs using the Means node.
Getting ready
This recipe uses the datafile cup98lrn_reduced_vars3.sav
and the stream recipe_variableselection_means.str
.
You will need a copy of Microsoft Excel to visualize the list of rules (optional).
How to do it...
To identify and select variables to include as model inputs using the Means node:
- Open the stream
variableselection_means.str
by navigating File | Open Stream. - Make sure the datafile points to the correct path to the file
cup98lrn_reduced_vars3.sav
. - Open the Means node to look at the options. Note that the grouping variable is our target variable TARGET_B, and the test fields are all the continuous variables of interest as shown in the following figure.
- Run the Means node by clicking on Run.
- Inside the output window, click on the
Importance
column twice so that the variables are sorted in descending order of Importance as shown in the following screenshot. - Identify variables whose importance score is greater than 0.9. These are good candidates to retain as inputs for your models.
- Open the Type node,
MEANS types
. Press Ctrl + A to select all fields, left-click on any variable's Role value, and select None. For TARGET_B, change the Role to Target, and for every variable identified in step 7, select Input as the role. Note that you can keep both the Means node output and Type node open at the same time.
How it works...
The Means node is an excellent way to examine the differences on average between groups for a Nominal
, Ordinal
, or Flag
target variable. When examining the difference of means based on a grouping variable in the Means node, Modeler generates an F Statistic value for each continuous (or Flag) variable and computes the associated significance value. This value is called Importance in Modeler, where a value of 1.0 represents highly significant differences in the mean values and values of 0.0 represent no difference in the mean values between groups.
The mean values for each input are shown in the columns, one value for each target variable value. For a Flag
variable, there will be two values. For TARGET_B, the way to interpret the results is: for the TARGET_B having value 1
, the average CARDPROM value is 19.64
, whereas when TARGET_B with value 0
, the average CARDPROM value is 18.371
.
There is no right value to use as a cutoff indicating which variables are good or not. A value of 0.9 is a conservative cutoff. Note that the more records one has, the higher the Importance score tends to become. As a result, large datasets can show high Importance scores even when the difference in mean values is quite small. If this is the case, one can increase the Importance cut-off to 0.95 or even 0.99.
There's more...
More information about the F-test can be seen by navigating to the View | Advance report setting. In this report, the four values for each variable are the mean value, the standard deviation, the standard error (for the mean value), and the record count. In addition, the F-Test's F Statistic is revealed in addition to the Importance score shown in the simple report.
The F-statistic value itself can be revealed by navigating to the View | Advanced option (refer to the following screenshot). Unfortunately, as shown in the screenshot, sorting by the F-Test value does not sort numerically in all versions of Clementine and Modeler. Some versions sort by ASCII character set value so all leading 9 values will be at the top. To see a true numerically sorted list, one can export the report by navigating to File | Export HTML and load the report into Excel and sort it there.
As with the correlation matrix variable selection, selecting or removing a large number of variables may be tedious and prone to error, so writing a CLEM script to customize the Type node or Filter node can help.
See also
- The Selecting variables using the CHAID Modeling Node and Selecting variables using single-antecedent Association Rules recipes in this chapter