With the Olympics coming up I thought it would be an interesting opportunity to skill up on some machine learning approaches. I’ve read various examples that try to predict results around horse racing, football, etc. but I’m missing some details for my use case.
Let’s use swimming as an example. I have a bunch of historical performance data for an athlete over various distances and events. I want to train a classification model to be able to predict a winner in a given race.
I’ve worked out how to use GroupKFold to ensure the data in my train and test sets are cleanly split along the boundaries on my race_id
so that I don’t have half of some races ending up in both. I then train the model on various features targeting a classification/outcome based on the value of is_winner
.
The model needs to make it’s best effort of looking at the test set and prediction the winner in each race. I’m not trying to look at the whole dataset and say “classify all of the rows that could be a winner”. This would be unhelpful in the event something like a final, which is composed of ~8 swimmers who could reasonably be undefeated for all of their domestic race and most of their international ones, wherein it’s only once or twice a year the race against the half dozen other people on the planet that are faster than them.
Which is why I’m trying to work out how to classify and predict within an individual race. The best solution I’ve come up with so far is to manually loop over every single race in the test set, test the prediction only on those ~8 entrants, repeat, keep a running total of results to calculate precision/accuracy/recall/etc. This is proving to be incredibly slow though.
Is there a better approach I could explore?
Glenn Gillen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.