I have a large json file (>10GB) for training on a 2-node GPU. The dataset is used to build a data loader:
...
sampler = torch.utils.data.distributed.DistributedSampler(
dataset, num_replicas=world_size, rank=rank)
data_loader = torch.utils.data.DataLoader(dataset,
batch_size=micro_batch_size,
sampler=sampler,
shuffle=False,
num_workers=num_workers,
drop_last=drop_last,
pin_memory=True,
collate_fn=task_collate_fn)
...
How to build a dataset from single large json file that can be used by distributed sample rand data loader? Reading file content from disk to memory is not possible here.