I am using xgboost version 2.1.0
When converting a pandas dataframe containing category columns to a DMatrix using xgboost.DMatrix() with ‘enable_categorical’=True, all behaves as expected unless the dataframe is one returned by sklearn train_test_split(), despite the datatypes of all columns remaining category.
The following code produces expected behavior:
import pandas as pd
import xgboost as xgb
import seaborn as sns
tips = sns.load_dataset('tips')
X, y = tips.drop('tip', axis=1), tips['tip']
print(X.dtypes)
# convert to DMatrix
dm = xgb.DMatrix(X, y, enable_categorical=True)
dm
---------------------------
total_bill float64
sex category
smoker category
day category
time category
size int64
dtype: object
<xgboost.core.DMatrix at 0x23153d7cf10>
The following code throws an error:
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=1)
print(X_train.dtypes)
dtrain = xgb.DMatrix(X_train, y_train, enable_categorical=True)
-----------------------------
total_bill float64
sex category
smoker category
day category
time category
size int64
dtype: object
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`. Invalid columns:sex: category, smoker: category, day: category, time: category
I am confused.
Thank you.