The existing dataframe is
| header | body |
| ——– | ————– |
| xxx | ‘{“name”:”john”,”age”:20,”emails”:[“[email protected]”,”[email protected]”]}’|
| xxx | ‘{“name”:”jerry”,”age”:30,”emails”:[“[email protected]”,”[email protected]”]}’ |
its schema:
root
--body: string(nullable = true)
--header: string(nullable = true)
I want to extract column ‘body’ and convert it from string to a new dataframe as below:
| name | age | emails |
| ——– | ————– | ————– |
| john | 20 |[“[email protected]”,”[email protected]”] |
| jerry | 30 |[“[email protected]”,”[email protected]”] |
I tried
df1 = df.withColumn('body', sf.to__json(sf.col('body', ArrayType(StringType)))
but got ‘elementType <class ‘pyspark.sql.types.StringType’> should be an instance of <class ‘pyspark.sql.types.DataType’> ‘
Also tried df1 = df.select("body.*")
got error ‘can only star expand struct data type, attribute arraybuffer(body)’
how to convert the column’body’ from String to new dataframe in a effective way?