I am trying to find the minimum number of features to include in my xgboost regression model, to avoid overfitting. I’m doing this by fitting my xgboost model to my data (X_train) and predicting, then calculating AIC (as shown in the python code below). Then I drop one feature out of the X_train pandas dataframe, fit, predict, and calculate AIC again. What I’m noticing is that as I drop features out of my X_train dataframe my AIC is actually going up. What I’m wondering is if my approach is correct? If my AIC calculation is correct? If the answers to both those questions are yes, then what might explain the AIC increasing as I drop features from the data the model is trained on? The code I’m using to calc AIC is below.
code:
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import numpy as np
# Assuming X_train and y_train are pandas DataFrames
# Convert them to numpy arrays, as XGBoost works with numpy arrays directly
X_train_np = X_train.values
y_train_np = y_train.values.flatten() # Assuming y_train is a single column DataFrame, convert it to a 1D array
# Fit the XGBoost regression model
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror')
xg_reg.fit(X_train_np, y_train_np)
# Predict on the training data
y_pred = xg_reg.predict(X_train_np)
# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_train_np, y_pred)
# Number of parameters in the model
num_params = len(xg_reg.get_booster().get_dump()) + 1 # Adding 1 for the intercept
# Calculate the Akaike Information Criterion (AIC)
n = len(y_train_np)
aic = n * np.log(mse) + 2 * num_params
print("AIC:", aic)