I’m working on a PySpark script to calculate similarity scores between columns using FuzzyWuzzy. I’ve defined a UDF for this purpose and used a for loop to iterate over columns specified in a metadata table, storing the similarity scores in the same DataFrame.
Here’s the UDF I defined:
similarity_udf = F.udf(lambda x, y: fuzz.ratio(x, y), IntegerType())
I have a metadata table that lists the column pairs for which I need to compute similarity scores. Here’s the simplified code:
meta_data = [
(‘column1’, ‘column2’),
(‘column3’, ‘column4’),
# More column pairs
]
for col1, col2 in meta_data:
df = df.withColumn(f'{col1}_{col2}_similarity’, similarity_udf(F.col(col1), F.col(col2)))
After calculating the similarity scores, I filter the DataFrame to show scores greater than 95 for all similarity columns. However, when I run show() or count(), or try to load the data into Snowflake, I encounter a timeout error:
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File “C:PythonLibsocket.py”, line 709, in readinto
raise
TimeoutError: timed out
- There are only 8 records in the DataFrame.
- I tried changing the Spark session configuration, but it didn’t resolve the issue.
Technologies and versions:
- Python 3.11.6
- Java version “1.8.0_371”
- Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
- Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
- PySpark version 3.5.0
- Using Scala version 2.12.18, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_371
Questions:
- Is my approach of using a for loop to iterate over column pairs and apply the UDF correct in PySpark?
- Why might the show() and count() operations cause a timeout even with such a small dataset?
- Are there any best practices or alternative methods for calculating and filtering similarity scores across multiple column pairs in PySpark?
- How can I debug or resolve this timeout issue to successfully load data into Snowflake?
Vanith C is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.