Relative Content

Tag Archive for pythonhuggingface-transformershuggingfacehuggingface-tokenizershuggingface-datasets

HuggingFace: Efficient Large-Scale Embedding Extraction for DNA Sequences Using Transformers

I have a very large dataframe (60+ million rows) that I would like to use a transformer model to grab the embeddings for these rows (DNA sequences). Basically, this involves tokenizing first, then I can get the embeddings.
Because of RAM limits, I have found that tokenizing and then embedding all in one py file won’t work. Here’s the workaround I found that worked for a dataframe with ~30million rows (but it isn’t working for the larger df):