PySpark socket timeout error when using .collect() or .count()
I have been working with PySpark and distributed computing to do work with dataframes that involve querying PI I have been working with the User Defined functions and have managed to get a semi working code. However, I am getting this socket timeout error, and cannot seem to trace back what the root cause of it is, because it triggers at different points in the program at different times of execution. On top of all of that the program has the occasional successful run, in which no errors are thrown…