Decision trees in R; cptable

jerrytuttle · March 13, 2023, 2:24pm

Hi. Would someone please explain how to derive the numbers in the decision tree cptable, starting with the cp value? Here is a simple dataset to use. Thank you.

library(rpart)
df ← data.frame(x=c(1, 2, 3, 3, 3), y=c(0, 0, 1, 0, 1))
model<-rpart(y ~ x, data = df, method=“class”, minbucket = 1, minsplit=1, xval=5)
summary(model)

now_samantha · March 14, 2023, 2:47pm

I can only give you a partial answer.

With your data set, there are two possible trees that we can make: either the trivial tree with 0 splits, or a tree with 1 split based on whether or not x < 2.5.

For both possible trees, we can compute the error rate using a variety of possible methodologies. I’m not sure which one rpart is using. The relative error column is the ratio of those, that is for each possible tree, we are finding (error rate)/(error rate for trivial tree), hence the `relative’ in the title. The relative error rate will start at 1 for the trivial tree and be monotonically decreasing as the number of splits increases.

CP stands for complexity parameter. When generating a tree, we want to know how far down we go before stopping. We will continue to grow our tree as long as growing it causes the relative error to decrease by more than the CP, and stop if the decrease is <= the CP. As the relative error decreases by 0.5 from the trivial tree to the full tree, here we will stick with the trivial tree if CP >= 0.5, and go to the full tree if CP < 0.5. The 0.5 in the first row indicates that threshold. I think the 0.01 in the 2nd row is the default for rpart, and as there are no other trees with different predictions beyond 1 split it is choosing to display that rather than 0.

The xerror column stands for the cross-validation error. Google suggests it wants to use 10-fold cross validation. Given the size of your data set, I assume it is using 5 fold cross validation.

I don’t know how the xstd column is generated.

jraven · March 15, 2023, 6:49am

Based on some other additional examples it looks like the error being used (at least with those parameters to rpart) is probably the number of misclassifications. In this case, the trivial tree (which assigns 0 to everything) misclassifies 2 of the inputs; the maximal tree misclassifies 1 of the inputs (one of the x=3 cases). So the relative error for the maximal tree is 1/2.

jerrytuttle · March 15, 2023, 3:43pm

Hi. Thanks to both of you. Maybe I picked too trivial an example, so let me change the example. My goal is to be able to derive the cptable so I can understand it. I can get an xerror and xstd for the best cp, but I don’t know how to get them for other cp’s, and I don’t know how to calculate cp’s from xerrors. There is also a scaling somehow. Thanks.

this is a scrubbed version of titanic, n= 891

library(rpart)
library(rpart.plot)
library(caret)
df ← read.csv(“https://static-resources.zybooks.com/static/titanic.csv”)
colnames(df)[1] ← “survived”
df ← subset(df, select=-c(embarked, embark_town, class, deck, alive))
df$age[is.na(df$age)] ← round(mean(df$age, na.rm = TRUE),0)
rpmodel ← rpart(survived ~ ., method=“class”, data=df, minbucket = 1, minsplit=1, xval=5)
summary(rpmodel) # includes cptable
rpart.plot(rpmodel, extra=104)
plotcp(rpmodel)
k ← 10
df$survived ← factor(df$survived)
train_control ← trainControl(method = “cv”, number = k)
cvmodel ← train(survived ~ ., data = df, method = “rpart”, trControl = train_control)
print(cvmodel)
mean(cvmodel$resample$Accuracy)
sd(cvmodel$resample$Accuracy)

now_samantha · March 15, 2023, 4:37pm

You can have rpart generate the trees based on whatever cp value you want, so let’s use your Titanic data set with cp = 0.001.

> model4 <- rpart(survived ~ ., method="class", data=df, 
minbucket=1, minsplit=1, xval=5, cp=0.001)

> summary(model4)
            CP nsplit  rel error    xerror       xstd
1  0.450292398      0 1.00000000 1.0000000 0.04244576
2  0.052631579      1 0.54970760 0.5643275 0.03595352
3  0.010721248      3 0.44444444 0.4883041 0.03406141
4  0.008771930      6 0.41228070 0.4415205 0.03274458
5  0.005847953      9 0.38596491 0.4415205 0.03274458
6  0.002923977     25 0.28362573 0.4327485 0.03248331
7  0.002088555     42 0.23391813 0.4385965 0.03265801
8  0.001949318     49 0.21929825 0.4532164 0.03308566
9  0.001461988     64 0.18713450 0.5204678 0.03489714
10 0.001000000    139 0.07309942 0.5584795 0.03581795

For each row, CP = (decrease in rel error) / (increase in n split). E.g., 0.45029 = 1 - 0.54971, 0.05263 = (0.54971 - 0.44444)/(3-1), etc. Why? When doing complexity pruning with a parameter $\alpha$, we pick the tree that minimizes (error measurement) + $\alpha |T |$. If $\alpha > 0.45029$ then that is minimized when $|T|=0$ and our rel. error is 1, when $0.45029 > \alpha > 0.05263$ it is minimized when $|T|=1$, when $0.05263 > \alpha > 0.01072$ it is minimized when $|T|=3$, etc. Because I set CP = 0.001 in the rpart call, our table continues until CP = 0.001.

You cannot calculate cp from the xerror values. The point of the xerror, i.e. cross-validation error values, is to pick which tree is `best’. The fact that xerror is increasing in my last few rows suggest that we are over fitting when nsplit > 42. One rule of thumb that I have seen (I assume there are others) is to pick the smallest tree such that xerror < (smallest possible xerror + xstd from tree with smallest possible xerror). Here, that means pick the smallest tree with xerror < 0.43275 + 0.03248 = 0.47, so we want nsplit = 6.

jerrytuttle · March 15, 2023, 10:35pm

Now_samantha, this was quite helpful. Thank you.

jraven · March 16, 2023, 3:54am

The xval parameter in the rpart call is the number of cross-validations to use – so 5 in this case. That means the code will break your data into 5 disjoint subsets (“folds”) and do some analysis for each (see below). Because you don’t know how the data was split into pieces you generally won’t be able to reproduce those results yourself, though you would be able to for it your initial 5-point data set (since when you have 5 data points there’s only one way to pick the 5 folds).

The process for getting the cross-validation results (xerror, xstd) for a given CP should basically be:

For each fold, the training set will be all the data points not in the fold. The fold itself will be the validation dataset.
Build a model using the training set
Choose the tree from this model with the given CP (the row with the biggest CP less than or equal to the given CP)
Calculate the error of that tree on the training set, which in this case is probably the number of misclassifications divided by the size of the training set

After performing that for each of the folds, then calculate the average and sample standard deviation of the errors you got for the folds. Up to some scaling (which should be the same across the different CPs) that should give you the xerror and xstd values in the CP table.

jerrytuttle · March 16, 2023, 6:09pm

Thanks jraven. I think my original 5 point dataset was too small. Let’s continue with this Titanic dataset.

Cross-validation is helpfully choosing the optimal cp. For that cp, I can find the error for each fold, and the mean of them agrees with the cptable for that cp. What if I want the cross-validation errors for a specific cp. Somewhere within

train_control ← trainControl(method = “cv”, number = k)
cvmodel ← train(survived ~ ., data = df, method = “rpart”, trControl = train_control)

I should be able to insert a specific cp value, but the train_control line does not permit it. Any ideas?