Pyspark version: 3.3.0-amzn-0
Python: 3.7.16
I am using the below code snippet where I am trying to use spark session as a parameter while calling a function by name test using ProcessPoolExecutor.
from concurrent.futures import ProcessPoolExecutor, as_completed
from time import sleep
from pyspark.sql import SparkSession
def test(spark):
spark.sql("select 1").show()
pass
if __name__ == '__main__':
future_bag = {}
spark = SparkSession.builder.appName("vamsi_test").enableHiveSupport().getOrCreate()
with ProcessPoolExecutor(2) as exe:
for i in range(1, 3):
if i == 7: sleep(2)
future = exe.submit(test, spark)
future_bag[future] = "job_{}".format(i)
for future in as_completed(future_bag):
job = future_bag[future]
try:
future.result()
except Exception as exp:
print("Exception is:", exp)
If I execute the above code I am getting the below error/exception:
Exception is: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers.For more information, see SPARK-5063.
Kindly help me to fix the issue such that I would be able to pass spark session while executing a function using ProcessPoolExecutor