(The complete code is a bit lengthy, so it’s attached at the end of the post.)
When performing LDA topic extraction with Gensim, my program encountered the following error, even though I am not directly using the triu function anywhere:
File "~/topicmodel.py", line 1, in <module>
from gensim import corpora, models
File "~/miniconda3/envs/lda_td/lib/python3.12/site-packages/gensim/__init__.py", line 11, in <module>
from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils # noqa:F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/miniconda3/envs/lda_td/lib/python3.12/site-packages/gensim/corpora/__init__.py", line 6, in <module>
from .indexedcorpus import IndexedCorpus # noqa:F401 must appear before the other classes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/miniconda3/envs/lda_td/lib/python3.12/site-packages/gensim/corpora/indexedcorpus.py", line 14, in <module>
from gensim import interfaces, utils
File "~/miniconda3/envs/lda_td/lib/python3.12/site-packages/gensim/interfaces.py", line 19, in <module>
from gensim import utils, matutils
File "~/miniconda3/envs/lda_td/lib/python3.12/site-packages/gensim/matutils.py", line 20, in <module>
from scipy.linalg import get_blas_funcs, triu
ImportError: cannot import name 'triu' from 'scipy.linalg' (~/miniconda3/envs/lda_td/lib/python3.12/site-packages/scipy/linalg/__init__.py)
My environment:
- Operating System: Ubuntu 22.04
- Python Version: 3.12.3
- Gensim Version: 4.3.2
- SciPy Version: 1.13.0
Here’s my code:
from gensim import corpora, models # Import libraries for topic modeling
import re # Regular expressions for text cleaning
import nltk # Natural Language Toolkit for text processing
from nltk.corpus import stopwords # Stopwords list
from nltk.stem import WordNetLemmatizer # Lemmatizer for word normalization
from collections import Counter # Counting word frequencies
def preprocess_text(text):
"""
Preprocesses text data by cleaning, tokenizing, removing stopwords, lemmatizing, and filtering low-frequency words.
"""
# Cleaning
text = re.sub(r'[^a-zA-Zs]', ' ', text) # Remove non-alphanumeric characters
text = text.lower() # Convert text to lowercase
# Tokenization
nltk.download('punkt') # Download Punkt sentence tokenizer (if not already downloaded)
tokens = nltk.word_tokenize(text) # Split text into tokens
# Stopword removal
nltk.download('stopwords') # Download stopwords list (if not already downloaded)
stop_words = stopwords.words('english') # Get English stopwords
stop_words.append("x") # Add custom stopwords if needed
stop_words.append("p")
filtered_tokens = [token for token in tokens if token not in stop_words] # Remove stopwords
# Lemmatization
nltk.download('wordnet') # Download WordNet lemmatizer (if not already downloaded)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens] # Lemmatize words
# Low-frequency word removal
word_counts = Counter(lemmatized_tokens) # Count word frequencies
min_count = 2 # Set minimum word frequency threshold
filtered_tokens = [token for token in lemmatized_tokens if word_counts[token] >= min_count] # Remove low-frequency words
return filtered_tokens
def extract_topics(filenames, num_topics=20):
"""
Extracts topics from multiple text files using LDA topic modeling.
"""
processed_corpus = []
for filename in filenames:
with open(filename, "r", encoding="utf-8") as f:
text = f.read()
processed_text = preprocess_text(text) # Preprocess text
processed_corpus.append(processed_text) # Add preprocessed text to corpus
# Create dictionary and corpus
dictionary = corpora.Dictionary(processed_corpus) # Create dictionary of words
corpus = [dictionary.doc2bow(text) for text in processed_corpus] # Convert corpus to bag-of-words format
# Train LDA model
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=num_topics) # Train LDA model
# Print topic keywords
print(lda_model.print_topics()) # Print top keywords for each topic
# Example usage
filenames = ["eco1.txt", "eco2.txt", "eco3.txt"] # List of filenames
extract_topics(filenames) # Extract topics from files
Here’s what I have tried so far:
- Upgraded SciPy to the latest version.
- Reinstalled both SciPy and Gensim.
- Checked my code for circular imports or naming conflicts.
- Created a new virtual environment and reinstalled all dependencies.
However, the problem persists. I suspect there might be an issue within Gensim or its dependencies that causes an indirect reference to the triu function to fail.
Has anyone encountered a similar issue? Are there any solutions or troubleshooting tips you could suggest?
Owleye4 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.