I have a very large dataframe (60+ million rows) that I would like to use a transformer model to grab the embeddings for these rows (DNA sequences). Basically, this involves tokenizing first, then I can get the embeddings.
Because of RAM limits, I have found that tokenizing and then embedding all in one py file won’t work. Here’s the workaround I found that worked for a dataframe with ~30million rows (but it isn’t working for the larger df):
- tokenizing– saving the output as 200 chunks/shards
- feeding those 200 chunks separately to get embedded
- these embeddings then get concatenated into one larger file of embeddings
final embedding file should have these columns:
[[‘Chromosome’, ‘label’, ’embeddings’]]
Overall, I’m a little lost in terms of how I can get this to work for my larger dataset.
I’ve looked into streaming the dataset, but I don’t think that will actually help because I need all of the embeddings, not just a few (please correct me if I am mistaken).
Ideally, I would like to avoid having to shard the data, but I would just like the code to run at this point without reaching the RAM limit.
step 1
dataset = Dataset.from_pandas(element_final[['Chromosome', 'sequence', 'label']])
dataset = dataset.shuffle(seed=42)
tokenizer = AutoTokenizer.from_pretrained(f"InstaDeepAI/nucleotide-transformer-500m-human-ref")
def tokenize_function(examples):
outputs = tokenizer.batch_encode_plus(examples["sequence"], return_tensors="pt", truncation=False, padding=False, max_length=80)
return outputs
# Creating tokenized dataset
tokenized_dataset = dataset.map(
tokenize_function,
batched=True, batch_size=2000)
tokenized_dataset.save_to_disk(f"tokenized_elements/tokenized_{ELEMENT}", num_shards=200)
step 2 (this code runs over each of the 200 shards)
input_file = f"tokenized_elements/tokenized_{ELEMENT_LABEL}/{filename}.arrow"
# Load input data
d1 = Dataset.from_file(input_file)
def embed_function(examples):
torch.cuda.empty_cache()
gc.collect()
inputs = torch.tensor(examples['input_ids']) # Convert to tensor
inputs = inputs.to(device)
with torch.no_grad():
outputs = model(input_ids=inputs, output_hidden_states=True)
# Step 3: Extract the embeddings
hidden_states = outputs.hidden_states # List of hidden states from all layers
embeddings = hidden_states[-1] # Assuming you want embeddings from the last layer
averaged_embeddings = torch.mean(embeddings, dim=1) # Calculate mean along dimension 1 (the dimension with size 86)
averaged_embeddings = averaged_embeddings.to(torch.float32) # Ensure float32 data type
return {'embeddings': averaged_embeddings}
# Map embeddings function to input data
embeddings = d1.map(embed_function, batched=True, batch_size=1550)
embeddings = embeddings.remove_columns(["input_ids", "attention_mask"])
# Save embeddings to disk
output_dir = f"embedded_elements/embeddings_{ELEMENT_LABEL}/{filename}" # Assuming ELEMENT_LABEL is defined elsewhere
step 3: concatenate all 200 shards of embeddings into 1.