I have this code:
<code> from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, explode, schema_of_json, lit
spark = SparkSession.builder.getOrCreate()
s = '{"job_id":"123","settings":{"task":[{"taskname":"task1","notebook_task":{"notebook_path":"path1"}},{"taskname":"task2","notebook_task":{"notebook_path":"path2"}}]}}'
schema = schema_of_json(lit(s))
result_df = (
spark.createDataFrame([s], "string")
.select(from_json(col("value"), schema).alias("data"))
.select("data.job_id", explode("data.settings.task.taskname").alias("taskname"))
)
result_df.show()
</code>
<code> from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, explode, schema_of_json, lit
spark = SparkSession.builder.getOrCreate()
s = '{"job_id":"123","settings":{"task":[{"taskname":"task1","notebook_task":{"notebook_path":"path1"}},{"taskname":"task2","notebook_task":{"notebook_path":"path2"}}]}}'
schema = schema_of_json(lit(s))
result_df = (
spark.createDataFrame([s], "string")
.select(from_json(col("value"), schema).alias("data"))
.select("data.job_id", explode("data.settings.task.taskname").alias("taskname"))
)
result_df.show()
</code>
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, explode, schema_of_json, lit
spark = SparkSession.builder.getOrCreate()
s = '{"job_id":"123","settings":{"task":[{"taskname":"task1","notebook_task":{"notebook_path":"path1"}},{"taskname":"task2","notebook_task":{"notebook_path":"path2"}}]}}'
schema = schema_of_json(lit(s))
result_df = (
spark.createDataFrame([s], "string")
.select(from_json(col("value"), schema).alias("data"))
.select("data.job_id", explode("data.settings.task.taskname").alias("taskname"))
)
result_df.show()
Which generates this:
<code> +------+--------+
|job_id|taskname|
+------+--------+
| 123| task1|
| 123| task2|
+------+--------+
</code>
<code> +------+--------+
|job_id|taskname|
+------+--------+
| 123| task1|
| 123| task2|
+------+--------+
</code>
+------+--------+
|job_id|taskname|
+------+--------+
| 123| task1|
| 123| task2|
+------+--------+
How to add the field ‘notebook_path’ to the dataframe? It seems explode can’t generate more than one field, and can’t put two explode function in the select. I know I can create addition dataframes (job_id, notebook_path) and then join these two dataframes on job_id. Just wondering any better solution available?