I am training a classifier. My data comes from multiple datasets, each dataset contains multiple subjects, each subject has performed multiple trials. Currently my data structure on disk looks like this
-dataset1_subject1.pt
-dataset1_subject2.pt
-dataset1_subjectX.pt
.
.
.
-datasetN_subjectX.pt
Each subject performed a varying number of trials. So
torch.load(dataset1_subject1.pt)
returns something like n_trials, n_channel, n_time
Long story short, my dataset is too big for memory and I’m looking for a solution.
Each batch should contain samples/trials from multiple subjects.
I have red about multiple approaches. Some say it’s better to load some data, shuffle it, and put it into big blocks, load one block to memory, sample from it until it’s nothing left, and then load the next block. This sounds good to me since loading 10 blocks is faster than loading individual trials. However this makes cross validation more challenging since I need to sample new blocks for every cross Val split.
Others say it’s better to maybe split the data into individual samples. So instead of
ds1_sibject1.pt
I would save a lot of
ds1_subject1_sample001.pt
.
.
.
Files (77k files on my case).
Then I could use stuff like hdf5 or memory mapping for lazy loading (tbh, I dont know the difference but it seems to be similar)
Also do I need to split it into individual trials to use hdf5? or could I sample individual trials from a subject without loading the entire subject?
So much confusion there and chatGPT wasn’t able to help.
Also I have red about this https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/
which sounds very promising since I might use Multiple GPUs at some point.
Currently I tend to split it into 77k individual files, and use the method where it’s ziped as a .tar
What would you recommend? Anyone here faced similar problems? Help is very much appreciated!
Best
Samuel:)