I’m trying to integration Spark Connect(spark-connect) into my spark jobs. For running the jobs in background I use celery combined with eventlet with concurrency of 5. This works fine for Spark Cluster which is setup in local, which has a Master and one Worker node.
spark = SparkSession.builder
.master(settings.SPARK_MASTER)
.appName(self.__app_name)
This works fine inside celery.
I tried to integrate spark connect into the existing jobs
spark = SparkSession.builder.remote("sc://xx.x.xxx.71:15002").getOrCreate()
This fails when run inside celery with eventlet pool, but works with solo pool. I enabled debug log for pyspark and found that there is a partial import error which is failing my jobs
ERROR/MainProcess] Task conn.union[afd002b6-39ca-4c51-b017-2ffd06f51c35] raised unexpected: ImportError("cannot import name 'SparkSession' from partially initialized module 'pyspark.sql.connect.session' (most likely due to a circular import) (C:\SCAD\ifp-dataprep\tenv\Lib\site-packages\pyspark\sql\connect\session.py)")
Traceback (most recent call last):
File "C:SCADifp-datapreptenvLibsite-packagesceleryapptrace.py", line 477, in trace_task
R = retval = fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "C:SCADifp-datapreptenvLibsite-packagesceleryapptrace.py", line 760, in __protected_call__
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:SCADifp-dataprepdjangoconn.py", line 26, in union
spark = SparkSession.builder.remote("sc://3.110.235.163:15002").create()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:SCADifp-datapreptenvLibsite-packagespysparksqlsession.py", line 524, in create
from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
ImportError: cannot import name 'SparkSession' from partially initialized module 'pyspark.sql.connect.session' (most likely due to a circular import) (C:SCADifp-datapreptenvLibsite-packagespysparksqlconnectsession.py)
What is causing this issue? Is it due to concurrency? How can I fix this?