I have a raw databricks table with a single colunm named “value” which the data is from a CSV with pipe delimited.
The data inside this single colunm is 5 fields.
The field3 content is a json which contains PIPE inside.
The data looks like:
10|20|{“menu”:{“id”:”file”,”menuitem”:[{“value”:”New”,”onclick”:”CreateNewDoc() || ReplaceDoc() || DeleteDoc()”}]}}|13123|344234
I’m new to pyspark and my data is massive, so i’ve tried to break the content to a new 5 colunms table, using regex in the field 3 with no sucess.
My code looks like:
from pyspark.sql.functions import split, col, regexp_extract
from pyspark.sql.functions import DataFrame
df = spark.sql(
"""
select value from my_table
"""
)
df_out = df.withColunm(field1, split(col("value", "\|".getItem(0))
.withColunm(field2, split(col("value", "\|".getItem(1))
.withColunm(field3, regex_extract("value", r"|{(.*?)}|",1))
.withColunm(field4, split(col("value", "\|".getItem(2))
.withColunm(field5, split(col("value", "\|".getItem(4))
new_table = df_out.select("field1", "field2", "field3", "field4", "field5")
new_table.display()
My result from this code is like some part of the json in field3 break incorrectly to field4 and field5.
Someone can help me to solve this problem?
Thank you!