Relative Content

Tag Archive for pythondataframepysparkuser-defined-functionsfuzzywuzzy

PySpark FuzzyWuzzy UDF Causing Timeout Errors on Small Dataset / TimeoutError when filtering columns with Fuzzy Wuzzy similarity scores in PySpark

I’m working on a PySpark script to calculate similarity scores between columns using FuzzyWuzzy. I’ve defined a UDF for this purpose and used a for loop to iterate over columns specified in a metadata table, storing the similarity scores in the same DataFrame.
Here’s the UDF I defined:
similarity_udf = F.udf(lambda x, y: fuzz.ratio(x, y), IntegerType())