Since XGBoost 2.0 base_score
is automatically calculated if it is not specified when initialising an estimator. I naively thought it would simply use the mean of the target, but this does not seem to be the case:
import json
import shap # only for the dataset
import xgboost as xgb
print('shap.__version__:',shap.__version__)
print('xgb.__version__:',xgb.__version__)
print()
X, y = shap.datasets.adult()
estimator = xgb.XGBClassifier(
objective='binary:logistic',
n_estimators=200)
estimator.fit(X,y)
print('y.mean():',y.mean())
print("float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']):",float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']))
Output:
shap.__version__: 0.46.0
xgb.__version__: 2.1.0
y.mean(): 0.2408095574460244
float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']): 0.26177529
The difference is too big for a rounding error. So how is base_score
calculated? I think this is the relevant commit, but it’s hard to tell exactly what it does.