There are multiple relevant posts on this error but most have no answer, and the ones with answers have different context than mine.
Question: What may be causing the following error, and how can we fix it? The error occurs at line df = spark.createDataFrame([.....])
Remarks:
- Error does not occur if I simply use
df = spark.range(10)
and it successfully displays data. - Python version 3.12, Spark: 3.3.4, PySpark: 3.3.4
- java version “1.8.0_401”, Java(TM) SE Runtime Environment (build 1.8.0_401-b10), Java HotSpot(TM) 64-Bit Server VM (build 25.401-b10, mixed mode)
- I’m using
VSCode
as an editor onWindows10
Code:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import findspark
findspark.init()
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0))
])
Error:
Py4JError Traceback (most recent call last)
Cell In[6], line 9
6 from pyspark.sql import Row
8 # df = spark.range(10)
----> 9 df = spark.createDataFrame([
10 Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0))
11 ])
File c:VSCode_PyProjectsPySpark_Official_proj.venvLibsite-packagespysparksqlsession.py:1443, in SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
1438 if has_pandas and isinstance(data, pd.DataFrame):
1439 # Create a DataFrame from pandas DataFrame.
1440 return super(SparkSession, self).createDataFrame( # type: ignore[call-overload]
1441 data, schema, samplingRatio, verifySchema
1442 )
-> 1443 return self._create_dataframe(
1444 data, schema, samplingRatio, verifySchema # type: ignore[arg-type]
1445 )
File c:VSCode_PyProjectsPySpark_Official_proj.venvLibsite-packagespysparksqlsession.py:1485, in SparkSession._create_dataframe(self, data, schema, samplingRatio, verifySchema)
1483 rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
1484 else:
-> 1485 rdd, struct = self._createFromLocal(map(prepare, data), schema)
1486 assert self._jvm is not None
...
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)