I am creating IDF values and my python code runs much slower compared to the pyspark implementation (2+ hours for mine versus seconds) and I am interested why that is so. I know pyspark is Java based, but the difference seems to be more than Python vs. Java. I’m using a simple function like so:
def calc_idf(data, terms):
# data is a list of lists filled with tokenized data
# terms is a list of the tokens to calculate IDF values
num_docs = len(data)
idf_values = []
for term in tqdm(terms, desc="IDF", position=0, leave=True):
idf_val = 0
for doc in data:
if term in doc:
idf_val += 1
idf_values.append(math.log2((num_docs+1)/(idf_val+1))) # Using base 2 as original paper did
return idf_values
The IDF I am using is from this documentation (https://spark.apache.org/docs/3.5.1/api/python/reference/api/pyspark.mllib.feature.IDF.html). I don’t think my implementation is relevant, but it can be found on this question (Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors), just know that it is significantly slower.
Could somebody please advise on how I could improve the speed of my IDF calculation