I have an application with 200GB data and I can’t load it in RAM. Based on the data/problem, it will be extremely beneficial to have TFRecord files “sorted” in a certain way that makes them NOT random/shuffled. A key benefit of having sorted shards is that it will allow me to quickly change my training/test data to perform KFold analysis. However, the training on this data is extremely sensitive improperly shuffled training data, and on top of that, there are likely benefits to “ensuring” that each batch of training data includes some elements from each shard (similar to class balance). I dont think that my application needs to randomly select elements from each shard in order for the training data to be sufficiently random, however I don’t want the batches to look similar between epochs. I think the balance between these demands is being able to start reading each shard at a different point for each epoch. Is this possible?
I have a working TFRecordData set, but I think the batches may look similar between epochs. I searched Tensorflow documentation for solutions to this but couldn’t find a good one.
ThanksForTheHelp is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
3