I’m following some tutorials on doing data engineering and feature engineering using boston dataset sample and here is an example where I’m trying the different impute strategy with cross validation to actually identify which impute strategy delivers the best model performance. To my surprise, all of them delivered the exact same results. How is this possible? Here is what I have tried:
class ImputeStrategy(Enum):
MEAN = 'mean'
MEDIAN = 'median'
MOST_FREQUENT = 'most_frequent'
CONSTANT = 'constant'
def evaluate_imputation_strategies(X, y):
# Evaluate each strategy on the dataset
results = list()
for s in ImputeStrategy:
# Create the modeling pipeline
pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy=s.value)),
('model', LinearRegression())
])
# Evaluate the model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='neg_mean_squared_error', cv=cv, n_jobs=-1)
# Convert scores to positive
scores = -scores
# Store results
results.append(scores)
print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores)))
return results
Here is how I call it:
# Split into input and output elements dynamically
X = boston.iloc[:, :-1] # Select all columns except the last one as features
y = boston.iloc[:, -1] # Select only the last column as the target
Here is what I see printed:
>ImputeStrategy.MEAN 23.849 (9.459)
>ImputeStrategy.MEDIAN 23.849 (9.459)
>ImputeStrategy.MOST_FREQUENT 23.849 (9.459)
>ImputeStrategy.CONSTANT 23.849 (9.459)
Is the result correct?