I am working on a vendor provided dataset with unstructured columns. The column is struct type with multiple key-value pairs.
example:
ID | Type | Data |
---|---|---|
1 | Day | {Day_1:’Monitor’, Day_2:’Manage’, Evening_6:’None’, Evening_3:’Monitor’} |
2 | Evening | {Day_2:’N/A’, Day_3:’None’, Evening_1:’On Call’, Evening_2:’On Call’} |
3 | Day | {Day_1:’Manage’, Day_5:’On Call’, Evening_2:’None’} |
I need to give it an structure for querying. Based on the Type column, find the corresponding shifts from the Data struct columns and then pivot them to separate rows by splitting the key on underscore.
So, the output should look like this:
ID | Type | Shift | Role |
---|---|---|---|
1 | Day | 1 | Monitor |
1 | Day | 2 | Manage |
2 | Evening | 1 | On Call |
2 | Evening | 2 | On Call |
3 | Day | 1 | Manage |
3 | Day | 5 | On Call |
I am working in pyspark and trying to do this as follows:
- Form a regex using Type column value + ‘_’ + number
- Select columns from struct ‘Data’ column using the regexp_extract.
- Convert columns to rows using explode function
- Split on underscore in the value of column name and keep the 2nd index value.
I am new to pyspark and would like some help in implementing this.