The write option seems to be failing when it gets to the write command, before that it is able to pass the pandas dataframe. before that wit seems to be creating and applying methods fine. The error that I keep getting when trying to write is: [PYTHON_VERSION_MISMATCH] Python in worker has different version (3, 11) than that in driver 3.9, PySpark cannot run with different minor versions
The version is 3.9 in the editor and the versions of pandas and pyspark should be compatible so I don’t know where the issue is coming from.
I’ve tried specifying the driver version in the spark constructor to 3.9, I’ve exported the driver environment variables to be 3.9 and that hasn’t solved the issue. Python version is 3.9, pandas 1.5.3, and pyspark 3.50. I tried with 3.11 and that wouldn’t run because pyspark isn’t supported/won’t import
spark = SparkSession.builder
.master("local[*]")
.appName("Test Connection")
.config('spark.jars', '/usr/local/bin/postgresql-42.7.3.jar')
.config("spark.executorEnv.PYSPARK_PYTHON", "/usr/local/bin/python3.9")
.config("spark.executorEnv.PYSPARK_DRIVER_PYTHON", "/usr/local/bin/python3.9")
.config("spark.pyspark.python", "/usr/local/bin/python3.9")
.config("spark.pyspark.driver.python", "/usr/local/bin/python3.9")
.getOrCreate()
df = pandas.read_csv(data_file)
df['title'] = df['title'].apply(lambda title: decode_from_base64(title))
df_spark = spark.createDataFrame(df)
df_spark = df_spark.withColumn("snapshot_time(UTC)",df_spark["snapshot_time(UTC)"].cast(TimestampType()))
df_spark_schema = spark.createDataFrame(df_spark.rdd, schema=TABLE_SCHEMA)
df_spark_schema = df_spark.withColumnRenamed("snapshot_time(UTC)", "snapshot_time_utc")
df_spark_schema.write.jdbc(url=f'jdbc:postgresql://{ip_addr}:{port}/{db}', table=main_table, properties=connection, mode='append') #breaks here
Joseph W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.