I have a table with XML data in it which is stored in a column with STRING datatype.
Can someone please help me on how to parse this XML into multiple sql columns.
Below is the sample XML Table & desired output data.
Select * from default.SampleData
XMLData:
'Status': {
'Code': 'Ok',
'Message': None,
'Errors': None
},
'RequestReference': None,
'Response': {
'IdNumber': '295',
'SearchKeys': {
'Subject': {
'FirstName': 'John',
'MiddleName': None,
'Surname': 'Ross',
'DateOfBirth': datetime.datetime(1900, 1, 31, 0, 0),
'Gender': 'M'
},
'Address': {
'AddressLine1': '123 old street',
'AddressLine2': 'high street',
'AddressLine3': '',
'AddressLine4': None,
'AddressLine5': None,
'Postcode': '12345',
'Country': None
},
'Phone': None,
'DrivingLicence': None,
'Bank': None,
'ConsentFlag': True
},
Expected output columns
IDNumber FirstName MiddleName LastName DateOfBirth AddressLine1 Phone
295 John Ross 1900-01-01 123 Highstreet Nonedatab
6
Your data seems to be json and not the xml but here is the code for both the ways.
For parsing xml data your xml string should be something like below.
'<a><b>b</b><c>cc</c></a>'
and parse it like below.
df.withColumn("parse_XMLData", xpath_string(col("XMLData"), lit("a/c"))).display()
Output:
| XMLData | parse_XMLData |
|--------------------------|---------------|
| <a><b>b</b><c>cc</c></a> | cc |
xpath_string
give first xml node which matches the expression.
Similarly you give like below.
df.withColumn("parse_XMLData", xpath_string(col("XMLData"), lit("IdNumber"))).display()
If you want all the node matching the expression you use xpath
function which returns array of strings matched the given expression.
Next, if it is the json data you use from_json
function like mentioned here.