Using CHAID stumps when interviewing an SME>

In this recipe we will learn how to use the interactive mode of the CHAID Modeling node to explore data. The name stump comes from the idea that we grow just one branch and stop. The exploration will have the goal of answering five questions:

  1. What variables seem predictive of the target?
  2. Do the most predictive variables make sense?
  3. What questions are most useful to pose to the Subject Matter Experts (SMEs) about data quality?
  4. What is the potential value of the favorite variables of the SMEs?
  5. What missing data challenges are present in the data?

Getting ready

We will start with a blank stream.

How to do it...

To use CHAID stumps:

  1. Add a Source node to the stream for the cup98lrn reduced vars2.txt file. Ensure that the field delimiter is Tab and that the Strip lead and trail spaces option is set to Both.
  2. Add a Type node and declare TARGET_B as flag and as the target. Set TARGET_D, RFA_2, RFA_2A, and RFA_2F, RFA_2R to None.
  3. Add a CHAID Modeling node and make sure that it is in interactive mode.
  4. Click on the Run button. From the menus choose Tree | Grow Branch with Custom Split. Then click on the Predictors button.
  5. Allow the top variable, LASTGIFT, to form a branch. Note that LASTGIFT does not seem to have missing values.
  6. Further down the list, the RAMNT_ series variables do have missing values. Placing the mouse on the root node (Node 0) choose Tree | Grow Branch with Custom Split again.
  7. The figure shows RAMNT_8, but your results may differ somewhat as CHAIDtakes an internal partition and therefore does not use all of the data. The slight differences can change the ranking of similar variables. Allow the branch to grow on your selected variable.
  8. Now we will break away the missing data into its own category. Repeat the steps leading up to this branch, but before clicking on the Grow button, select Custom and at the bottom, set Missing values into as Separate Node.
  9. Sometimes SMEs will have a particular interest in a variable because it has been known to be valuable in the past, or they are invested in the variable in some way. Even though it is well down the list, choose the variable Wealth2 and force it to branch while ensuring that missing values are placed into a Separate node.

How it works...

There are several advantages to exploring data in this way with CHAID. If you have accidentally included perfect predictors it will become obvious in a hurry. This recipe is dedicated to this phenomenon. Another advantage is that most SMEs find CHAID rather intuitive. It is easy to see what the relationships are without extensive exposure to the technique. Meanwhile, as an added benefit, the SMEs are becoming acquainted with a technique that might also be used during the modeling phase. As we have seen, CHAID can show missing data as a Separate node. This feature is shown to be useful in the Binning scale variables to address missing data recipe in Chapter 3, Data Preparation – Clean. By staying in interactive mode, the trees are kept simple; also, we can force any variable to branch even if it is not near the top of the list. Often SMEs can be quite adamant that a variable is important, while the data shows them otherwise. There are countless reasons why this might be the case, and the conversation should be allowed to unfold. One is likely to learn a great deal trying to figure out why a variable that seemed promising is not performing well in the CHAID model.

Let's examine the CHAID tree a bit more closely. The root node shows the total sample size and the percentage in each of the two categories. In the figures in this recipe, the red group is the donors group. Notice that the more recent their LASTGIFT was, the more likely that they donated. Starting with 8.286 percent for the less than or equal to 9 group, dropping down to 3.476 percent for the less than 19 group. Note that when you add up the child nodes, you get the same number as the number in the root node.

It is recommended that you take a screenshot of at least the top 10 or so variables of interest to management or SMEs. It is a good precaution to place the images on slides, since you will be able to review and discuss without waiting for Modeler to process. Having said that, it is an excellent idea to be ready to further explore the data using this technique on live data during the meeting.

See also

  • The Using the Feature Selection node creatively to remove or decapitate perfect predictors recipe in Chapter 2, Data Preparation – Select
  • The Binning scale variables to address missing data recipe in Chapter 3, Data Preparation – Clean