I’m using XGBoost to do a prediction task with R package released on Github. (2.0.3 Patch Release on Github)
packageVersion("xgboost")
[1] ‘2.0.3.1’
The following example code using iris
dataset demonstrates my issue. I predict Sepal.Length
(continuous) using the other variables. All of the predictors are numeric except for Species
(categorical). Species
has 3 levels. I first convert it to integers 1, 2, 3 and then use setinfo("feature_type")
to inform xgboost that it’s a categorical predictor.
library(xgboost)
data(iris)
y <- iris$Sepal.Length
x <- iris[, -1]
x$Species <- as.integer(x$Species)
x <- as.matrix(x)
dm <- xgb.DMatrix(data=x, label=y)
setinfo(dm, "feature_type", c("q", "q", "q", "c")) # 'q'=numeric, 'c'=categorical.
model <- xgb.train(
data = dm,
nrounds = 10
)
Everything goes well. However, when I save the model in plain text with xgb.dump()
, it raises an error.
xgb.dump(model, 'model.dump')
Error: Check failed: is_numerical:
f3
in feature map is categorical but tree node is numerical.
According the message, f3
is Species
column, and it’s correctly regarded as categorical in feature map. But why it said “tree node is numerical” ? Does it mean that setinfo("feature_type")
takes no effect in the boosted trees? Is Species
still treated as numerical under the hood ?
P.S. I’ll not convert Species
into dummy because I want to calculate variable importances afterwards. If I convert it into dummy, I will get 2 variable importances for each of species level. But I just want only one importance value for this predictor.