My question concerns using sklearn logistic regression and a train test split data set.
Given a data set (3500 rows, 27 features, with 2 labels 0,1). I have 20 known labels which I use along with 200 “known” 0 labels as a training and test set. I then train my Logistic Regression model. Run the test set and check some metrics.
Now I want to use the model on my original data set. Can I use the full original data set with the model or do I need to remove my training set? I would have 3280 rows not my original 3500.
I have been told that to include the original training set causes a data leakage problem as the model has already seen the training set?
Michael Place is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1