How can I run an entire HuggingFace iterable_dataset through a function before it reaches another function
I am building my own tokenizer using the byte pair encoding algorithm. I am applying this on a HuggingFace dataset. Due to memory constraints of my computer, I am first converting the dataset to a iterable_dataset which has lazy processing. Here is the part of my tokenizer function that applies to the problem.