I am trying to train a machine learning model which uses the PowerTransformer
from scikit-learn
to transform my training data. And here is my code:
from sklearn import preprocessing
from sklearn.preprocessing import PowerTransformer
yj = PowerTransformer(method='yeo-johnson')
df = yj.fit_transform(df)
dump(yj, 'yeo_johnson_scaler.bin', compress=True)
and it works perfectly fine. Then when I actually deploy my model to new data, I reload the fitted transformer as follows:
yj=load('yeo_johnson_scaler.bin')
df = yj.transform(df)
However, I got the following warning message:
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator StandardScaler from version 1.2.2 when using version 1.3.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
So as far as I understand, there is an update in scikit-learn that might lead to inconsistencies, I tried to read the link given but I don’t really understand why this would lead to a problem in my code and how to resolve it.
Scikit-learn doesn’t use a sophisticated serialization method to store trained models. It just pickles them.
What pickling does is pretty much copying the content of objects as byte stream. Unpickling only works correctly if the class definition of each object is the same as the one used for pickling.
If a class definition has changed, the raw byte stream can’t be safely unpickled (e.g. remove or add an attribute and the whole content is shifted in the byte stream and you get everything messed up).
Scikit-learn warns about this if you pickle then unpickle with different versions.
TL;DR: train with the version you use to deploy.
If you can’t do that, this page about model persistence could be a useful read.