I have the following piece of code
vertices = df1.select(F.explode(F.array('issuer_id_indexed', 'receiver_id_indexed')))
.distinct()
.rdd.map(lambda row: row[0]) # Transform the exploded values into an RDD
# Create an index for each vertex
vertices_with_index = vertices.zipWithIndex().map(lambda x: (x[1], x[0])) # Swap to (index, value)
# Convert the RDD to a DataFrame with column names 'id' and 'node'
vertex_df = vertices_with_index.toDF(["id", "node"])
# Show the resulting DataFrame
vertex_df.show()
When I try to run the above code, I get the error or traceback bellow
The error seems to come when I call the toDF function on my vertices_with_index in-order to convert it from a pyspark.rdd.PipelinedRDD to a Spark Dataframe.
I am currently trying to run Pyspark on my local machine and I have tried the steps below to try to solve this error with no luck:
- Ensured that my env variables are correct
- Installed a lower version of Java 8
- Increased the amount of resources given to Spark.
Any help will be very much appreciated.