i am trying to convert my text into its embeddings using a bert model , when i apply this to my my dataset it works fine for some of my inputs then stops and gives that error
i have set TORCH_USE_CUDA_DSA and CUDA_LAUNCH_BLOCKING to 1 and the inputs are not excedding the number of token limit also and this is my embedding code i have tried to free the memory of my gpu and still it didnt work.
import torch
import numpy as np
def embeddings(text):
if len(text) > 4000:
flag = text.split(".")
t1 = flag[:len(flag) // 2]
t2 = flag[len(flag) // 2:]
t1 = ".".join(t1)
t2 = ".".join(t2)
emb_avg = np.mean([embeddings(t1), embeddings(t2)], axis=0)
return emb_avg
else:
with torch.no_grad():
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input.to(device)
output = model(**encoded_input)
emb = output.encoder_last_hidden_state
emb_np = emb.cpu().numpy()
del emb
del output
del encoded_input
gc.collect()
torch.cuda.empty_cache()
emb_avg = np.mean(emb_np, axis=1)
emb_avg = emb_avg.flatten()
torch.cuda.empty_cache()
return emb_avg
and im applying to my data set
from tqdm.auto import tqdm
tqdm.pandas()
df['emb'] = df['abstract'].progress_apply(embeddings)
New contributor
Gaurav B.V is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.