I implemented a machine learning model; to gain some information about the performance of the model, I looked at its classification_report from sklearn.metrics.
For example, here is a classification report:
The Classification Report
I basically have two questions:
- What is the difference between the support values next to the positive and negative classes (56 and 3147) and the support values next to the macro and weighted averages at the bottom (3203 and 3203), and which one should I use?
- From support, support is how many samples are in each class. Is this the samples that are in the original dataset, or the samples that are fed into the machine learning model? I am asking because I do resampling because the dataset is imbalanced. In other words, are the correct support values based off of the original (imbalanced) dataset or the dataset fed into the model (balanced)?
For my first question, I believe that the “correct” support values are the 3203 and 3203. This is similar as with my second question because support is based off of the dataset fed into the model I think, so it should be balanced (because how would the model “see” the original dataset)?
By the way, everything is in a pipeline so there are no data leaks or the model “seeing” test data, if that may be relevant.
My question is not a duplicate of the one in the above link, as I am asking about the whole classification report, and not just one part.
user167433 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.