I have a dataframe with slightly more than half a million rows. The columns hold very small values, so the dataframe is easily stored in memory.
However on the other hand, I have around 700GB of data, spread into 50-row chunks of binary files, corresponding to each row of my dataframe. So given a row from my small-sized dataframe, I’m able to access the corresponding column value from the 700GB “dataset”.
What I want to do is to create one single CSV file from all of these and some additional data that I will produce for each row.
For context, here’s the access function and the generator function that I’ve coded for retrieving a column value from my 700GB data set given a row index:
filenames = os.listdir("/content/drive/MyDrive/ProteiNNGO_Data/") # The folder that has the chunks
def get_embedding_chunk(index, old_chunk_start):
# The chunks are stored with this name pattern, including the start and end indices
pattern = re.compile(r'^(?:embeddingsssss_|featuress_)(d+)_(d+)$')
for filename in filenames:
match = pattern.match(filename)
if match:
start, end = int(match.group(1)), int(match.group(2))
# If we are accessing data that is in the same chunk as before, don't load it again
if start == old_chunk_start:
return None, start
# Otherwise, if we're in the right chunk, load the chunk
if start <= index < end:
with open(f"/content/drive/MyDrive/ProteiNNGO_Data/{filename}", "rb") as f:
return pickle.load(f), start
And here’s a generator that I coded that will efficiently iterate over rows
def embedding_generator():
chunk = None
start = -1
for i in range(len(sequences)):
new_chunk, new_start = get_embedding_chunk(i, start)
start = new_start
if new_chunk is not None:
del chunk
chunk = new_chunk
offset = i - start
yield i, chunk[offset]
gc.collect()
How can I use this iterator and add each row to a dataframe without having the entire dataframe in the memory?