Thiết kế website giá rẻ

Question

I have a dataframe with slightly more than half a million rows. The columns hold very small values, so the dataframe is easily stored in memory.

However on the other hand, I have around 700GB of data, spread into 50-row chunks of binary files, corresponding to each row of my dataframe. So given a row from my small-sized dataframe, I’m able to access the corresponding column value from the 700GB “dataset”.

What I want to do is to create one single CSV file from all of these and some additional data that I will produce for each row.

For context, here’s the access function and the generator function that I’ve coded for retrieving a column value from my 700GB data set given a row index:

<code>filenames = os.listdir("/content/drive/MyDrive/ProteiNNGO_Data/") # The folder that has the chunks

def get_embedding_chunk(index, old_chunk_start):

# The chunks are stored with this name pattern, including the start and end indices

pattern = re.compile(r'^(?:embeddingsssss_|featuress_)(d+)_(d+)$')

for filename in filenames:

match = pattern.match(filename)

if match:

start, end = int(match.group(1)), int(match.group(2))

# If we are accessing data that is in the same chunk as before, don't load it again

if start == old_chunk_start:

return None, start

# Otherwise, if we're in the right chunk, load the chunk

if start <= index < end:

with open(f"/content/drive/MyDrive/ProteiNNGO_Data/{filename}", "rb") as f:

return pickle.load(f), start

</code>

<code>filenames = os.listdir("/content/drive/MyDrive/ProteiNNGO_Data/") # The folder that has the chunks def get_embedding_chunk(index, old_chunk_start): # The chunks are stored with this name pattern, including the start and end indices pattern = re.compile(r'^(?:embeddingsssss_|featuress_)(d+)_(d+)$') for filename in filenames: match = pattern.match(filename) if match: start, end = int(match.group(1)), int(match.group(2)) # If we are accessing data that is in the same chunk as before, don't load it again if start == old_chunk_start: return None, start # Otherwise, if we're in the right chunk, load the chunk if start <= index < end: with open(f"/content/drive/MyDrive/ProteiNNGO_Data/{filename}", "rb") as f: return pickle.load(f), start </code>

filenames = os.listdir("/content/drive/MyDrive/ProteiNNGO_Data/") # The folder that has the chunks
def get_embedding_chunk(index, old_chunk_start):
  # The chunks are stored with this name pattern, including the start and end indices
  pattern = re.compile(r'^(?:embeddingsssss_|featuress_)(d+)_(d+)$')
  for filename in filenames:
    match = pattern.match(filename)
    if match:
      start, end = int(match.group(1)), int(match.group(2))
      # If we are accessing data that is in the same chunk as before, don't load it again
      if start == old_chunk_start:
        return None, start

      # Otherwise, if we're in the right chunk, load the chunk
      if start <= index < end:
        with open(f"/content/drive/MyDrive/ProteiNNGO_Data/{filename}", "rb") as f:
          return pickle.load(f), start

And here’s a generator that I coded that will efficiently iterate over rows

<code>def embedding_generator():

chunk = None

start = -1

for i in range(len(sequences)):

new_chunk, new_start = get_embedding_chunk(i, start)

start = new_start

if new_chunk is not None:

del chunk

chunk = new_chunk

offset = i - start

yield i, chunk[offset]

gc.collect()

</code>

<code>def embedding_generator(): chunk = None start = -1 for i in range(len(sequences)): new_chunk, new_start = get_embedding_chunk(i, start) start = new_start if new_chunk is not None: del chunk chunk = new_chunk offset = i - start yield i, chunk[offset] gc.collect() </code>

def embedding_generator():
  chunk = None
  start = -1
  for i in range(len(sequences)):
    new_chunk, new_start = get_embedding_chunk(i, start)
    start = new_start
    if new_chunk is not None:
      del chunk
      chunk = new_chunk
    offset = i - start
    yield i, chunk[offset]
    gc.collect()

How can I use this iterator and add each row to a dataframe without having the entire dataframe in the memory?

Thiết kế website giá rẻ

Danh mục

Working on a dataframe without loading it into memory