Problem statement: I want to utilise driver and worker node of my Databricks cluster. I submit the code to UDF to run it on worker node, but I do not know how can I control/tune it further. My code is IO intensive so I want to play with concurrency at worker’s core level. I tried to find internal working concepts like architecture of worker and driver node execution/processing, but there are not enough resources. Can someone please help?
Also, I tried to use print statement or logger, but I see UDF is not printing it. Is there a way to print logs? I tried below option:
import logging
logger.setLevel(log4jLogger.Level.INFO)
It only prints driver logs.
I hope your UDF isn’t doing any heavy lifting. UDFs are designed to be lightweight and typically run on a single executor core. You can refer to this link to see if your task fits that description.
You can’t directly control concurrency within a UDF. Instead, Spark manages parallelism by distributing data partitions across the available cores in the cluster.
To answer your second question, UDF logs are typically written to the executor logs. You can access these logs through the Spark UI. You can check the stdout
and stderr
logs in the executor tab of the Spark UI.