I have a column of my dataframe data
that contains date and value information in a long string. The column, we will call x
for this purpose, is formatted as such:
x = "{date1:val1, date2:val2, date3:val3, ...}"
I want to ultimately explode this data such that I create two new columns, one for date and one for val. In order to utilize the explode
function, I understand that the column must be formatted as an array, not a string. So far, to handle this issue, I have removed the braces at the start and end of the string:
from pyspark.sql import functions as F
data = data.withColumn('x_1', F.regexp.replace('x', r'{', ''))
data = data.withColumn('x_1', F.regexp.replace('x_1', r'}', '')
I then created a list variable:
data = data.withColumn('x_list', F.split('x_1', ', '))
I now have that x_list = [date1:val1, date2:val2, date3:val3, ...]
What I now need is to add quotes around each list element such that I ultimately get ['date1':'val1', 'date2':'val2', 'date3':'val3', ...]
I believe that it may be possible to iterate through the list and use regex to add quotes using the colon (:) as a split point, but I am struggling with how to do that. I believe that it would look something like:
for l in x_list:
#some regex expression
Alternatively, I have considered creating a sublist of each list element, but I am not sure how I would then use those sublists to create two new columns.
1
Your string is not a valid json and also not a valid dict string. You could go with this:
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, ArrayType
@F.udf(returnType=ArrayType(StringType()))
def parse_(s):
if s is None: return None
return [item.split(":")[1] for item in s.strip("{}").split(",")]
df = spark.createDataFrame([[1, "{date1:val1, date2:val2, date3:val3}"]], schema=["col1", "col2"])
display(df.withColumn("date", F.explode(parse_("col2"))))