I’m new to ML and would like to know more about classification. I have a small dataset of n=600 scored samples and thousands of potential metrics, all categorical (True or False). Basically, I would like to tell which of these thousands of metrics have the best predictive value against the known score so I can use them on an unknown dataset. I’m also thinking of summing up the true values of the good features together to have a single numerical metric that would easily show which samples are the most likely to have high scores (assuming having a true value in each feature correlates with having a higher score)
Oh, and if that helps, I usually code in Python but am also a neophyte when it comes to using GPUs in my scripts, so if you have any advice regarding that, please let me know!
I understand that with this amount of samples it’s complicated to build an actual model (especially since score categories have significant size discrepancies), so I am mostly looking to better understand the impact of each feature and how to combine them in a meaningful way.
Thank you!
So far I have been using prior knowledge about the features to arbitrarily select 40 of them without conducting any tests. I have combined them into a single numerical metric by summation on which I performed pearson correlation. Needless to say the values I obtain are quite low (around 0,35 or so) and that becomes way worse when obtained a R2 score following regression against the known scores.
Marwan Haioun is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.