My script called lemmatize.py
looks like this:
import json
import stanza
def lemmatize_text(language: str, text: str):
"""
Lemmatizes the input text for a specified language and returns a map of original words to lemmatized words.
:param language: The language code (e.g., 'en' for English).
:param text: The input text to be lemmatized.
:return: A dictionary mapping original words to lemmatized words.
"""
# Initialize the pipeline for the specified language
nlp = stanza.Pipeline(lang=language, processors='tokenize,pos,lemma', use_gpu=False)
# Process the text
doc = nlp(text)
# Create a dictionary to map original words to their lemmatized forms
word_to_lemma = {}
for sentence in doc.sentences:
for word in sentence.words:
word_to_lemma[word.text] = word.lemma
return word_to_lemma
def lemmatize_words_from_json(language: str):
"""
Lemmatizes each word from the JSON file created by the `write_top_words_with_frequencies_to_json` function.
:param input_file: The input JSON file containing words and their frequencies.
:param language: The language code (e.g., 'en' for English).
"""
input_file = f'import/language/frequency/data/{language}.json'
output_file = f'import/language/frequency/lemmatized/{language}.json'
try:
# Read the JSON file
with open(input_file, mode='r', encoding='utf-8') as file:
words_data = json.load(file)
# Extract words from the JSON data
words = [entry['word'] for entry in words_data]
# Join the words into a single text for processing
text = ' '.join(words)
# Lemmatize the words
print(f'Lemmatizing words for {language}...')
word_to_lemma = lemmatize_text(language, text)
# Write the mapping of original words to lemmatized words to a new JSON file
with open(output_file, mode='w', encoding='utf-8') as file:
json.dump(word_to_lemma, file, ensure_ascii=False, indent=2)
print(f'Lemmatized words written to {output_file}')
except Exception as e:
print(f'Error writing {output_file}, {e}')
lemmatize_words_from_json('ar')
lemmatize_words_from_json('bn')
lemmatize_words_from_json('bg')
lemmatize_words_from_json('ca')
lemmatize_words_from_json('zh')
lemmatize_words_from_json('cs')
lemmatize_words_from_json('da')
lemmatize_words_from_json('nl')
lemmatize_words_from_json('en')
lemmatize_words_from_json('fi')
lemmatize_words_from_json('fr')
lemmatize_words_from_json('de')
lemmatize_words_from_json('el')
lemmatize_words_from_json('he')
lemmatize_words_from_json('hi')
lemmatize_words_from_json('hu')
lemmatize_words_from_json('is')
lemmatize_words_from_json('id')
lemmatize_words_from_json('it')
lemmatize_words_from_json('ja')
lemmatize_words_from_json('ko')
lemmatize_words_from_json('lv')
lemmatize_words_from_json('lt')
lemmatize_words_from_json('mk')
lemmatize_words_from_json('ms')
lemmatize_words_from_json('nb')
lemmatize_words_from_json('fa')
lemmatize_words_from_json('pl')
lemmatize_words_from_json('pt')
lemmatize_words_from_json('ro')
lemmatize_words_from_json('ru')
lemmatize_words_from_json('sk')
lemmatize_words_from_json('sl')
lemmatize_words_from_json('sh')
lemmatize_words_from_json('es')
lemmatize_words_from_json('sv')
lemmatize_words_from_json('fil')
lemmatize_words_from_json('ta')
lemmatize_words_from_json('tr')
lemmatize_words_from_json('uk')
lemmatize_words_from_json('ur')
lemmatize_words_from_json('vi')
It just hangs before even logging that Lemmatizing words for {language}
. Here are the logs:
$ python3 import/language/frequency/lemmatize.py
Lemmatizing words for ar...
2024-08-03 20:43:14 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 386kB [00:00, 27.9MB/s]
2024-08-03 20:43:14 INFO: Downloaded file to /Users/me/stanza_resources/resources.json
2024-08-03 20:43:14 WARNING: Language ar package default expects mwt, which has been added
2024-08-03 20:43:14 INFO: Loading these models for language: ar (Arabic):
=============================
| Processor | Package |
-----------------------------
| tokenize | padt |
| mwt | padt |
| pos | padt_charlm |
| lemma | padt_nocharlm |
=============================
2024-08-03 20:43:14 INFO: Using device: cpu
2024-08-03 20:43:14 INFO: Loading: tokenize
/opt/miniconda3/lib/python3.12/site-packages/stanza/models/tokenization/trainer.py:82: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-08-03 20:43:15 INFO: Loading: mwt
/opt/miniconda3/lib/python3.12/site-packages/stanza/models/mwt/trainer.py:170: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-08-03 20:43:15 INFO: Loading: pos
/opt/miniconda3/lib/python3.12/site-packages/stanza/models/pos/trainer.py:139: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(filename, lambda storage, loc: storage)
/opt/miniconda3/lib/python3.12/site-packages/stanza/models/common/pretrain.py:56: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
data = torch.load(self.filename, lambda storage, loc: storage)
/opt/miniconda3/lib/python3.12/site-packages/stanza/models/common/char_model.py:271: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(filename, lambda storage, loc: storage)
2024-08-03 20:43:15 INFO: Loading: lemma
/opt/miniconda3/lib/python3.12/site-packages/stanza/models/lemma/trainer.py:236: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-08-03 20:43:15 INFO: Done loading processors!
[HANGING]...
What is happening? How do I get this to work? It just hangs there… It doesn’t actually say [HANGING]...
, but that’s where it stops and just hangs.