I’m unsure how to handle some errors that I’m getting when converting a pandas dataframe into a PySpark dataframe. My Pandas dataframe has a column “array_output”, which is an array created from an image using OpenCV. It looks something like this:
array([[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
...
Converting the dataframe to PySpark gives me the following error:
UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Can only convert 1-dimensional array values
I’ve tried flattening the array with this:
.apply(lambda x: [item for sublist in x for item in sublist])
But that doesn’t seem to work either, as I get the same error.
Any thoughts on how I can accomplish this would be appreciated. Thanks in advance.