I’v read that spark session context is thread-safe but not in all cases.
I have multi thread application which is organized so :
N – workers serving event-bus and sending some simple Spark tasks .
Now i want to add job description to each task to make pipeline more observable using something like :
def withJobDesciption[T](desc:String)(fn: => T)(implicit spark: SparkSession) : T = {
spark.sparkContext.setJobDescription(desc)
try {
fn
} finally {
spark.sparkContext.setJobDescription("")
}
}
sparkContext
looks like global storage and must not be shared between threads. From the other point of view i see in the oficial codebase parts like this :
org/apache/spark/util/HadoopFSUtils.scala
val previousJobDescription = sc.getLocalProperty(SparkContext.SPARK_JOB_DESCRIPTION)
val statusMap = try {
val description = paths.size match {
case 0 =>
"Listing leaf files and directories 0 paths"
case 1 =>
s"Listing leaf files and directories for 1 path:<br/>${paths(0)}"
case s =>
s"Listing leaf files and directories for $s paths:<br/>${paths(0)}, ..."
}
sc.setJobDescription(description)
....
finally {
sc.setJobDescription(previousJobDescription)
}
So my question is what is canonical way of setting jobDescription
in multi-threaded application and what is a proper way of using spark sessions in this environment ?