Issue
Before preprocessing my data with spaCy, I typically have my data stored in a Pandas Series. Since I’d like to preserve the index for each document before serializing my Docs, I decided to use the extension attribute. However, I noted a dramatic increase in the memory usage until my system runs out of memory. I’m not sure what I might be doing wrong.
Here is how I added the extension after initializing the Language class and adding the extension with Doc.set_extension("idx", default=None)
. I run nlp.pipe
on my text and add the extension idx
to each Doc:
def stream_text_series(series):
data = ((text, {"idx": str(idx)}) for idx, text in series.items())
for doc, context in self.nlp.pipe(
data, as_tuples=True
):
doc._.idx = context["idx"]
yield doc
And when saving my data as a DocBin, I create the DocBin with store_user_data=True
in order to save my extension:
def convert_text_series_to_docs_and_serialize(series):
doc_bin = DocBin(store_user_data=True)
for doc in stream_text_series(series):
doc_bin.add(doc)
return doc_bin.to_bytes()
Question: Am I implementing the extension feature incorrectly? Any thoughts of how I might proceed? Any suggestions are more than welcome!
Further details
- Used language model: “en_core_web_trf”.
- Memory usage: when I serialize my data without using extensions, my system uses about 3GB of RAM. With the extension, it uses all my available RAM (about 26GB).
- I ran the code in a fresh conda environment using the installation intructions on the spaCy website.
- The problem occurs whether I use the CPU or the GPU.