Multivariate adaptive regression splines

In the prior chapter, we went through a discussion on MARS, how it works, why use it, and so on, so I won't duplicate that here; other than that, it can be applied in a classification problem as a generalized linear model. One of the key benefits is its power to conduct feature selection, so there's no need to run stepwise or IV—or even regularization, for that matter.

We'll train it with 5-fold cross-validation and set nprune = 15 to limit the maximum number of features at 15. Recall from the previous chapter that more than 15 terms are possible as it fits piecewise splines.

This code will give us our model object. Be advised that this may take some time to complete:

> set.seed(1972)

> earth_fit <-
earth::earth(
x = train[, -142],
y = train[, 142],
pmethod = 'cv',
nfold = 5,
degree = 1,
minspan = -1,
nprune = 15,
glm = list(family = binomial)
)

Here's the model summary:

> summary(earth_fit)
Call: earth(x=train[,-142], y=train[,142], pmethod="cv",
glm=list(family=binomial), degree=1, nprune=15, nfold=5,
minspan=-1)

GLM coefficients
y
(Intercept) -3.4407
V23 -6.6750
V24 -1.3539
V105 -0.8200
h(28-V2) -0.4769
h(V2-28) 0.0071
h(1-V31) 1.4876
h(V31-1) 0.0947
h(106449-V141) 0.0000
h(V141-106449) 0.0000

Earth selected 10 of 10 terms, and 6 of 141 predictors using pmethod="cv"

As you can see in the summary, the model ended up with six total predictive features and a total of ten terms, including a hinge function on V2. By standard protocol, the paired hinge terms can be read first predictor less than the hinge value, and then predictor greater than or equal to hinge. For instance, for V31, a value less than 1 has a coefficient of 1.4876, otherwise 0.0947.

We can plot the linear interactions with respect to predicted probability. Setting ylim to NA helps to show changes in y (predicted probability) versus changes in feature values:

> plotmo::plotmo(earth_fit, ylim = NA)

The output of the preceding code is as follows:

Notice how, for V31, values equal to zero have one coefficient else another one as described previously. Feature importance is trivial to produce:

> earth::evimp(earth_fit)
nsubsets gcv rss
V31 9 100.0 100.0
V2 8 74.8 75.0
V141 7 40.2 41.1
V105 6 31.5 32.5
V23 5 26.8 27.8
V24 4 20.7 21.8

The nsubsets criterion counts the number of model subsets that include the feature. Features that included in more subsets are considered more important. These subsets are of the terms generated by earth's pruning pass.

Now, we specified cross-validation, but earth concurrently does forward and backward feature selection and elimination, using Generalized Cross-Validation (GCV) as discussed in the previous chapter. So, GCV and rss results for a feature are normalized from 0 to 100 for comparison purposes.

Like we did previously, a plot of the probability densities is useful, and earth comes with its own plotd() function:

I like how the predicted values are reversed compared to the prior plots. Other than that, it's hard to discern anything meaningful with the exception that the densities are quite similar. Let's get the cutoff value:

> mars_cutoff <-
InformationValue::optimalCutoff(
train$y,
pred,
optimiseFor = 'Both',
returnDiagnostics = TRUE
)

Examination of the object provides the following:

  • Optimal cutoff = 0.04976
  • TPR = 0.6449
  • FPR = 0.208

In comparison with logistic regression, we have a higher rate of finding true positives at the expense of a slightly higher rate of false positives.

Let's move on to evaluating performance on the test set:

> test_pred <- predict(earth_fit, test, type = 'response')

> Metrics::auc(test$y, test_pred)
[1] 0.8079

> Metrics::logLoss(test$y, test_pred)
[1] 0.1406

What do we see here? A slight improvement in AUC, and a lower (better) log-loss. While not dramatic, it may be of value. We can now turn to visually comparing the two models to confirm that MARS is indeed the preferred algorithm.