I have a very large .txt file (several gigabytes) that I need to split into training and test sets for a machine learning project. The usual methods of reading the entire file into memory and then splitting it are not feasible due to memory constraints. I am looking for a way to efficiently split the file without overloading memory.
I attempted to use scikit-learn for the split, but it loads the entire file into memory, which causes performance issues and is not suitable for my large dataset.
user26621042 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.