I have an issue with generating embeddings for my dataset. It consists of about 16 000 000 reddit comments (their bodies + some negligible metadata). I have them stored as a CSV file, from which I’m generating a Pandas dataframe using pd.read_csv().
The hard part starts when I try to generate embeddings. The previous iteration of my program used a dataset that was 3 orders of magnitude smaller, so generating an embedding for each comment was a trivial task – I would just do this individually for each comment and add the results as a new column.
The df["embedding"] = self.embedder.embed_str(temp_df["author"])
approach, however, has proven to be insufficient for my new dataset. I’ve spent more than 6 hours waiting for it to finish processing, only to get back to Killed
at the bottom of my terminal, seemingly due to heavy memory usage. I have also tried a parallel batched approach, however this resulted in an even more rapid killing of the process.
Is there some more efficient way of doing this, or should I just give up and leave embedding for the training process instead of making it a part of my dataset? I would appreciate any general guidance on this matter, since this is my first brush with data science.
To provide extra context, the aforementioned self.embedder.embed_str()
method is as follows:
def embed_str(self, data: str) -> torch.Tensor:
"""Generates an embedding for the given str data.
Args:
data: The data to be embedded.
Returns:
A PyTorch tensor containing the embedding.
"""
return self.model.encode(data)
, with self.model being a Jina Embeddings v2 SentenceTransformer initialised as such:
self.model = SentenceTransformer(
"jinaai/jina-embeddings-v2-base-en",
trust_remote_code=True,
)
For more context, I am running the code on Intel i5 10400F and 16 GB of RAM.