I have pyspark dataframe with only one column having below data
Gourav,Joshi,”Sibley Grove,402,wmbley,London,United Kingdom”,E1234,44982728930
Is there is any way to split this columns and create new dataframe like below.
I have tried to do this with split function but address fields getting spitted into multiple columns
CAUTION: The sample code below only demonstrates how to apply the split()
function with a regular expression to achieve the desired result. However, the entire use case needs to be analyzed to understand whether spark.read.csv()
from source would be a more suitable approach.
— CODE —
from pyspark.sql.functions import col, split
data = [("A,B,"C,D",F,G",)]
raw_string_column_name = "Raw"
split_array_column_name = "SplitTokenArray"
computed_column_names = ["FirstName", "LastName", "Address", "PinCode", "Phone"]
# Split
df = spark.createDataFrame(data, [raw_string_column_name])
df = df.withColumn(split_array_column_name, split(raw_string_column_name, r""",(?=(?:[^"]*"[^"]*")*[^"]*$)"""))
# Compute columns
column_count = 5
for i in range(column_count):
df = df.withColumn(computed_column_names[i], col(split_array_column_name)[i])
# Drop columns
df = df.drop(raw_string_column_name)
df = df.drop(split_array_column_name)
df.show()
— OUTPUT —
+---------+--------+-------+-------+-----+
|FirstName|LastName|Address|PinCode|Phone|
+---------+--------+-------+-------+-----+
| A| B| "C,D"| F| G|
+---------+--------+-------+-------+-----+