I have a LabelEncoder
with 500 classes.
To store and load it, I used pickle:
with open('../data/label_encoder_v500.pkl', 'rb') as file:
label_encoder = pickle.load(file)
I want to add 24 new classes to this encoder, keeping existing labels unchanged.
additional_classes = ['class501', 'class502', ..., 'class524']
but it seems like this operation does not come out-of-the-box with LabelEncoder
. How to do this?
Since the LabelEncoder is quite simple one option will to modify it to a expanded LabelEncoder with this functionality.
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d
import numpy as np
class ExpandedLabelEncoder(LabelEncoder):
def add_extra_labels(self, y):
y = column_or_1d(y, warn=True)
y_expand = np.concatenate((self.classes_, y))
super().fit(y_expand)
encoder = ExpandedLabelEncoder()
encoder.fit([class001', ..., 'class500'])
# save as you normally do
loaded and reuse the extra added functionality which allow you to add new labels
with open('../data/label_encoder_v500.pkl', 'rb') as file:
label_encoder = pickle.load(file)
additional_classes = ['class501', 'class502', ..., 'class524']
encoder.add_extra_labels(additional_classes)
Note that as with the standar LabelEncoder when using it to transform data, the encoding order is not the one use to fit labels but that of the order in which labels appear in the array pass to transform.
I am not sure what you want to do is actually possible. The natural idea, as stated previously, is to manually add labels to the classes_
parameter. However, it does not always behave as intended:
from sklearn.preprocessing import LabelEncoder
import numpy as np
le = LabelEncoder()
le.fit([1, 2, 2, 6])
le.transform([1, 1, 2, 6])
>>> array([0, 0, 1, 2])
Now let’s manually add a label :
import numpy as np
le.classes_ = np.concatenate([le.labels_, [4]])
le.transform([1,2,6,4])
>>> array([0, 1, 2, 2])
whereas you would have expected array([0,1,2,3])
. If you sort the labels while adding them,
le.classes_ = np.sort(np.concatenate([le.labels_, [4]]))
le.transform([1,2,6,4])
>>> array([0, 1, 3, 2])
Every figure gets a different index, however your “6” is not encoded the same way it was before, which may be a problem down the road.
Note, however, that the problem would have not occured if you new labels are all above the max label of your initial labels (for instance, if I had added a 7 instead of a 4 in the example above).
This is because transform
expects the labels to be sorted.
So, I guess a workaround is to transform your new labels so that you are certain they are all superior to the max existing label.
Yeah, this operation does not come out-of-the-box. You can extend classes_
attr manually.
Here is example
import pickle
import numpy as np
with open('../data/label_encoder_v500.pkl','rb') as file:
label_encoder = pickle.load(file)
additional_classes = ['class501', ...,]
updated_classes=np.concatenate([label_encoder.classes_,additional_classes])
label_encoder.classes_ =np.sort(updated_classes)
# create new file if you want
with open('../daa/label_encoder_v524.pkl','wb' ) as file:
pickle.dump(label_encoder,file)