I’m trying to build a pipeline for data preprocessing for my XGBoost model. The data contains NaNs and needs to be scaled. This is the relevant code:
xgb_pipe = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', preprocessing.StandardScaler()),
('regressor', xgboost.XGBRegressor(n_estimators=100, eta=0.1, objective = "reg:squarederror"))])
xgb_pipe.fit(train_x.values, train_y.values,
regressor__early_stopping_rounds=20,
regressor__eval_metric = "rmse",
regressor__eval_set = [[train_x.values, train_y.values],[test_x.values, test_y.values]])
The loss immediately increases and the training stops after 20 iterations.
If I remove the imputer and the scaler from the pipeline, it works and trains for the full 100 iterations. If I manually preprocess the data it also works as intended, so I know that the problem is not the data.
What am I missing?