I am new to xgboost.So may have made some low-level mistakes, hope to get help, thank you!
Briefly, I want to do a regression task with xgboost, consisting of several csv data sets. I spliced them together into a dataframe and split the train/val/test using train_test_split. The model worked well (mae: 0.6). But when I manually split the training set and the test set (I picked a part of the csv and put it in the test folder), the results became very poor (mae: 12+).
I’m really wondering what happened here? I’ve posted some of the code below.
1: Here is the split code with train_test_split:
# ready for data
datasets = []
path = '../data/low_fidelity_chips_res'
for filename in os.listdir(path):
if filename.endswith(".csv"):
dataset = ThermalDataset(os.path.join(path, filename))
datasets.append(dataset)
# combine dataset
[merged_dataset = pd.concat([pd.DataFrame(dataset.X) for dataset in datasets])
merged_targets = pd.concat([pd.DataFrame(dataset.y) for dataset in datasets])
X_scaled = scaler.fit_transform(merged_dataset)
y_scaled = scaler.fit_transform(merged_targets)
# divide
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=11)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=11)
# build xgboost model
model = xgb.XGBRegressor(tree_method='gpu_hist', gpu_id=device.index, n_estimators=500, learning_rate=0.05, max_depth=8)
model.fit(X_train, y_train)
# evaluation
y_val_pred = scaler.inverse_transform(y_val_pred_scaled.reshape(-1, 1)).flatten()
y_test_pred = scaler.inverse_transform(y_test_pred_scaled.reshape(-1, 1)).flatten()
y_val_original = scaler.inverse_transform(y_val.reshape(-1, 1)).flatten()
y_test_original = scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()
val_mse = mean_squared_error(y_val_original, y_val_pred)
test_mse = mean_squared_error(y_test_original, y_test_pred)
val_mae = mean_absolute_error(y_val_original, y_val_pred)
test_mae = mean_absolute_error(y_test_original, y_test_pred)]
2: Here is my manual division after the code:
`# Training dataset
datasets = []
path = '../data/low_fidelity_chips_res'
for filename in os.listdir(path):
if filename.endswith(".csv"):
dataset = ThermalDataset(os.path.join(path, filename))
datasets.append(dataset)
# testing data which is a csv from the testing set that I manually partitioned from the original data
tests = []
test_file = '../data/test_xgboost/Thermal014withMidPos.csv'
test = ThermalDataset(test_file)
tests.append(test)
# combine
merged_dataset = pd.concat([pd.DataFrame(dataset.X) for dataset in datasets])
merged_targets = pd.concat([pd.DataFrame(dataset.y) for dataset in datasets])
test_x = pd.concat([pd.DataFrame(test.X) for test in tests])
test_y = pd.concat([pd.DataFrame(test.y) for test in tests])
# normalization
X_scaled = scaler.fit_transform(merged_dataset)
y_scaled = scaler.fit_transform(merged_targets)
x_fill = scaler.fit_transform(test_x)
y_fill = scaler.fit_transform(test_y)
# split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.05, random_state=11)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=11)
# train
model = xgb.XGBRegressor(tree_method='gpu_hist', gpu_id=device.index, n_estimators=500, learning_rate=0.05, max_depth=8)
model.fit(X_train, y_train)
# evaluation
y_val_pred_scaled = model.predict(X_val)
y_test_pred_scaled = model.predict(X_test)
y_fill_res = model.predict(x_fill)
# inverse_transform to get original data
y_val_pred = scaler.inverse_transform(y_val_pred_scaled.reshape(-1, 1)).flatten()
y_test_pred = scaler.inverse_transform(y_test_pred_scaled.reshape(-1, 1)).flatten()
y_pre = scaler.inverse_transform(y_fill_res.reshape(-1, 1)).flatten()
y_val_original = scaler.inverse_transform(y_val.reshape(-1, 1)).flatten()
y_test_original = scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()
y_fill = scaler.inverse_transform(y_fill.reshape(-1, 1)).flatten()
val_mse = mean_squared_error(y_val_original, y_val_pred)
test_mse = mean_squared_error(y_test_original, y_test_pred)
val_mae = mean_absolute_error(y_val_original, y_val_pred)
test_mae = mean_absolute_error(y_pre, y_fill)`
I expect to be able to reason correctly on a single csv file and get as good a result as I did in training.
user25230137 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.