My use case is basic. I have 3 labels say positive, negative and neutral. The data for this ML model is streamed / batched. Let’s assume that each batch holds 100 samples (batch_size=100).
I can clearly see that this is an online / incremental learning problem. There is a possibility that my batch may get imbalanced / skewed data samples as well..
For example, B1-B4 may have all positive and B5-B10 may contain all negative and B11-B15 may have all neutral data samples..
Having understood the use case and after performing some basic research, I came up to use SGDClassifier on partial_fit () that addresses my problem very well.
The actual difficulty I face is after completing 15 batches of training (5 batches for each label), my model predicts all the inference data as the latest label.
On finding the root cause, I found that the model gets trained on the recent batches (neutral samples) in the last few batches of training and hence everything is being predicted as neutral.
Given this difficulty, I have another one as well. I cannot perform data balancing techniques like SMOTE, Undersampling, Oversampling, etc. as I do not have exposure to full dataset. At any given instance of time, I have access to only the current batch data (100 samples) and an model trained on previous batches (partial_fit () model).
Could you please me in finding an optimum solution for this case?
Shyam Ganesh S is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1