Relative Content

Tag Archive for apache-sparkpyspark

pyspark: Python worker exited unexpectedly (crashed)

I just installed spark on my local computer and try to run some simple code to test if everything is good, but it keeps showing me: Python worker exited unexpectedly (crashed). this is the code I’m running:

why might pyspark code take so long even when only running lazy transformations?

I have a block of pyspark code, about 40 lines, with only lazy transformations. I’ve painstakingly removed every action method since I need to delay execution outside of this block of code. However, the block of code when benchmarked still takes 15 minutes! Is there a reason why that may be the case? Might execution trigger if a DAG got too large? Is there a way I can confirm if actions are being taken?

Apache Spark PySpark Merge clause to support deleted rows

The following merge clause supports updates, and additional rows. However, if a row is deleted it will still show the row that was deleted. Can someone help modify the Merge clause such that rows that have been deleted won’t appear.

Setting `SPARK_USER` breaks S3 SDF writes

I am running Spark jobs on Kubernetes. Usernames are fixed for all users, so I was attempting to set the SPARK_USER environmental variable so that I could distinguish my jobs from others’ on the Spark History Server. However, I found that setting this env var breaks writing to S3.