I’m working with Hugging Face datasets and I need to split a dataset into training and validation sets. My main requirement is that the dataset should be processed in streaming mode, as I don’t want to load the entire dataset into memory.
<code>from datasets import load_dataset, DatasetDict
# Load a dataset from Hugging Face
dataset = load_dataset('squad', split='train')
# Split the dataset into training and validation sets
# Specify the fraction for the test set (validation set)
train_val_split = dataset.train_test_split(test_size=0.1)
# Extract the training and validation datasets
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']
# Print the size of the datasets
print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")
# Save the datasets if needed
# train_dataset.save_to_disk('path/to/train_dataset')
# val_dataset.save_to_disk('path/to/val_dataset')
</code>
<code>from datasets import load_dataset, DatasetDict
# Load a dataset from Hugging Face
dataset = load_dataset('squad', split='train')
# Split the dataset into training and validation sets
# Specify the fraction for the test set (validation set)
train_val_split = dataset.train_test_split(test_size=0.1)
# Extract the training and validation datasets
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']
# Print the size of the datasets
print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")
# Save the datasets if needed
# train_dataset.save_to_disk('path/to/train_dataset')
# val_dataset.save_to_disk('path/to/val_dataset')
</code>
from datasets import load_dataset, DatasetDict
# Load a dataset from Hugging Face
dataset = load_dataset('squad', split='train')
# Split the dataset into training and validation sets
# Specify the fraction for the test set (validation set)
train_val_split = dataset.train_test_split(test_size=0.1)
# Extract the training and validation datasets
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']
# Print the size of the datasets
print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")
# Save the datasets if needed
# train_dataset.save_to_disk('path/to/train_dataset')
# val_dataset.save_to_disk('path/to/val_dataset')
Is there an approach to split Hugging Face datasets in streaming mode? Any suggestions or improvements to my code would be greatly appreciated.
refs:
- https://discuss.huggingface.co/t/how-to-split-a-dataset-into-train-test-and-validation/1238
- https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090/21
- https://discuss.huggingface.co/t/possible-to-stream-and-create-new-splits/67214
- https://huggingface.co/docs/datasets/v1.11.0/splits.html
- https://discuss.huggingface.co/t/how-to-split-a-hugging-face-dataset-in-streaming-mode-without-loading-it-into-memory/87205