I have several thousand text files with methylation data in them. These are tab separated files, where I am only interested in two columns, the name of the methylation probe and a value in the column called “Beta”.
I want to create a table with a single column with the name of the methylation probe, and a column containing the beta values for each sample (1 sample per text file). Just to make things curly, each file contains ~ 840K to 860K probes. Most of them overlap, but there’s always a few that don’t. I’d like the end table to contain the union of the rows.
After moving from pandas to arrow, I had initially been reading the files in one by one using pyarrow.csv and concatenating the resulting tables. That appeared to work but it always runs into an oom situation and the process gets killed once I get past 1K samples.
I’ve been working through the documentation here, and as I understand it, I should have been using pyarrow.dataset, which I’ve switched over to. I have got as far as being able to create the dataset, and read data in, but I’m unable to figure out exactly what the next step is.
This gets me most of the way:
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.dataset as ds
file_format = ds.CsvFileFormat(parse_options=parse_options)
dataset = ds.dataset(filtered_samples, format= file_format)
ic(dataset)
for batch in dataset.to_batches(columns=["Name", "Beta"], batch_size = 500):
ic(batch)
From the output of that though, it looks like I’ve reading in 500 lines at a time, one file at a time? If that is the case, I don’t think that’s what I’m after and I don’t know where to go from here.
How do a take the batches from a dataset, and combine them into a single large table, and write that to file without running out of memory?