Selecting variables using the CHAID Modeling node>

In this recipe we will identify and select variables to include as model inputs using the CHAID node.

You will need a copy of Microsoft Excel to visualize and select the chi-square values for each variable.

Getting ready

This recipe uses the datafile cup98lrn_reduced_vars3.sav and the stream recipe_variableselection_chaid.str.

How to do it...

To identify and select variables to include as model inputs using the CHAID node:

  1. Open the stream variableselection_chaid.str by navigating to File | Open Stream and selecting the stream.
  2. Make sure the datafile points to the correct path for the file cup98lrn_reduced_vars3.sav.
  3. Open the Type node named CHAID Types. Notice that there are several variables of type continuous whose direction values have been set to Input, and a single continuous variable has its direction set to Target. The variable set to Target should be the target variable TARGET_B.
  4. Open the node TARGET_B and select the Interactive Model option.
  5. Begin to build the CHAID model by clicking on the Run button. When the interactive model split appears, click on the Predictors… button to reveal the chi-square statistic for all fields in order from the highest to lowest value.
  6. Click on the Predictors… button.

    This reveals the list of predictors and their associated probabilities.

  7. Click on any field in the list and press Ctrl + A to select all the variables in the list. Copy the selected variables with Ctrl + C. Open Microsoft Excel and create a new Workbook. Paste the buffer into Excel. This provides an easier way to identify which fields to keep.
  8. Identify all fields whose chi-square statistic values have a p-value greater than 0.05. These are good candidates to remove.
  9. In the Modeler stream, connect a Type node to the right of the CHAID Type node. Double-click on the Type node, and set variables that were selected in step 7 to None. As an alternative, one may use a Filter node to remove fields selected in step 7.

How it works...

In decision trees, the root (first or top) split identifies the variable that best separates the data into two or more subgroups that maximize a criterion of interest. A CHAID decision tree finds the single variable that has the largest chi-square statistic. However, to find this maximum, every variable must be examined and its chi-square statistic computed. The interactive mode reveals all of these values. The number of variables being tested doesn't affect this method significantly because computation for CHAID increases only linearly with the number of fields.

Once we have the chi-square statistic values and corresponding p values for every variable, we then can use this value to select which variables are good predictors on their own (that is, produce significant differences in the target variable values after the split). This variable list can then be used as a simple variable selection method.

There's more...

One doesn't need to use the 0.05 value to select variables; many reasonable metrics can be used to select fields. For example, once can choose the top 10 or 25 variables regardless of p-value. Or one can relax the p-value selection criterion from 0.05 to 0.1 or 0.15 to allow more variables to be included in the analysis. If large numbers of rows exist in the data, the p-values may be very small even for splits that don't appear to be very useful. In these cases, the splits may be statistically significant but not operationally significant. Feel free to adjust the threshold of p-values to one that reflects the operational significance of your problem.

As with the correlation matrix variable selection, selecting or removing a large number of variables may be tedious and prone to error, so writing a CLEM script to customize the Type node or Filter node can help.

Within the generated model, you have the option to create a Filter node that removes predictors or inputs that have not been used by the model or you can remove fields based on predictor importance:

If you select to generate a Filter node based on predictor importance, you then have additional options to include or exclude a certain number of fields or to include or exclude fields based on a specified level of importance:

See also

  • The Selecting variables using the Means node and Selecting variables using single-antecedent Association Rules recipes in this chapter