I’m trying to train a deep learning ranking model using list-wise loss. A single candidate list has approximately 100 items. If I have tens of millions of candidate lists, and if I can only afford one candidate list per batch because I must have all item-level scores of a candidate list simultaneously to compute loss, is it safe to assume that I’ll need to use a distributed, data parallelism approach to train my model if I want my model to train in a reasonable amount of time?