I’m looking for the best practice to avoid data leakage. I have 1 feature that requires mode imputation. The model is XGBoost Classifier.
These are the steps that I planned:
- Split data in random 80% training set – 20% test set
- Apply mode imputation on training and test set independently
- Execute RandomSearchCV on the training set for hyperparameter tuning
- Train model on the whole training set using the best parameter set found
- Evaluate model on unseen test set
Now, my doubt is: is it okay to perform the mode imputation on the whole training set and then perform RandomSearchCV? I thought that data leakage applies only when evaluating with the unseen test set.
Or should I perform the imputation in each fold of RandomSearchCV to avoid data leakage? if so, how can I do it? I saw Sklearn pipelines, but I cannot figure out how to apply the mode imputation only to the specific feature I need
Thank you in advance!
P.s. feedback on all the steps are welcome too if anything is wrong!