My use case is basic. I have 3 labels say positive, negative and neutral. The data for this ML model is streamed / batched. Let’s assume that each batch holds 100 samples (batch_size=100). I can clearly see that this is an online / incremental learning problem. There is a possibility that my batch may get imbalanced / skewed data samples as well.. For example, B1-B4 may have all positive and B5-B10 may contain all negative and B11-B15 may have all neutral data samples..
Having understood the use case and after performing some basic research, I came up to use SGDClassifier on partial_fit () that addresses my problem very well.
The actual difficulty I face is after completing 15 batches of training (5 batches for each label), my model predicts all the inference data as the latest label. On finding the root cause, I found that the model gets trained on the recent batches (neutral samples) in the last few batches of training and hence everything is being predicted as neutral.
Given this difficulty, I have another one as well. I cannot perform data balancing techniques like SMOTE, Undersampling, Oversampling, etc. as I do not have exposure to full dataset. At any given instance of time, I have access to only the current batch data (100 samples) and an model trained on previous batches (partial_fit () model).
start_timer = time.time()
column_classes_lists = [np.array(sgd_model[PIPELINE_MODEL_CONSTANTS.META][PIPELINE_MODEL_CONSTANTS.CLASSES].get(key)) for key in sgd_model[PIPELINE_MODEL_CONSTANTS.META][PIPELINE_MODEL_CONSTANTS.CLASSES].keys()]
if data is not None:
kf = KFold(n_splits = PREDICTOR_CONSTANTS.NUM_FOLDS)
y_predictions, y_actual = [], []
# target_df = pd.DataFrame(target)
for train_index, val_index in kf.split(data):
X_train_fold, X_val_fold = [data[idx] for idx in train_index], [data[idx] for idx in val_index]
y_train_fold = [[target[col][idx][0] for col in target.keys()] for idx in train_index]
y_val_fold = [[target[col][idx][0] for col in target.keys()] for idx in val_index]
sgd_model[PIPELINE_MODEL_CONSTANTS.MODEL].partial_fit(np.array(X_train_fold), y_train_fold, classes = column_classes_lists)
y_predictions.extend(sgd_model[PIPELINE_MODEL_CONSTANTS.MODEL].predict(X_val_fold))
y_actual.extend(y_val_fold)
Utils.calculate_prediction_metrics(sgd_model, target, y_actual, y_predictions, column_classes_lists, meta)
else:
Utils.calculate_prediction_metrics(sgd_model, target, None, None, column_classes_lists, meta)
logger.warning(f"SGDModel trained successfully for id: {id} in {time.time() - start_timer} secs and the score metrics are: {sgd_model[PIPELINE_MODEL_CONSTANTS.META][PIPELINE_MODEL_CONSTANTS.SCORES]}")
return sgd_model
In addition to this, at any point of time during my training process new labels may appear. For partial_fit() we need to pass all the possible labels in the first train call. Hence, I guess sklearn’s partial_fit() may not fit my case very well..
Could you please me in finding an optimum solution for this case with an implementation in Python?
I tried the above mentioned algotihm
Shyam Ganesh S is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.