I have a dataframe with approximately 5000 datapoints and I need to create bins for cross-validation. Additionally, I have a categorical metadata variable with around 1000 unique values. To prevent data leakage, I want to ensure that datapoints sharing the same metadata value are not split across different bins. The bins need to be approximately the same size, and preferably I want to have 5 bins.
I’ve searched for functions to achieve this, but they all seem to cater to numerical variables (e.g., pd.qcut
). Since this metadata variable is not used as a predictor in model training, I believe the optbinning
package is also not suitable. Does anyone know of a method or package that can help with this?
1