I have an xml file as below:
xml_s = """
<?xml version="1.0" encoding="utf-8"?>
<feed xml:base="https://url_path/1f3a6012-d6b2-4258-a91a-fc6d4e86c304/"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xmlns="http://www.w3.org/2005/Atom">
<title type="text">Data</title>
<id>https://<url_path>/1f3a6012-d6b2-4258-a91a-fc6d4e86c304/Data</id>
<updated>2024-07-18T05:22:18Z</updated>
<link rel="self" title="Data" href="Data" />
<m:count>5746</m:count>
<entry>
<id>uuid:58aaa654-1649-4061-97fe-ecc53e186fee;id=1</id>
<title type="text"></title>
<updated>2024-07-18T05:22:18Z</updated>
<category term="Intelex.Data"
scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:RowNum m:type="Edm.Int32">1</d:RowNum>
<d:Record_x0020_No m:type="Edm.Int64">491</d:Record_x0020_No>
<d:Location m:type="Edm.String">Amine > Shared</d:Location>
<d:MOC_x0020_Initiator m:type="Edm.String" m:null="true" />
<d:Status m:type="Edm.String">Vertical Heads Approval</d:Status>
<d:Workflow_x0020_Status m:type="Edm.String">1</d:Workflow_x0020_Status>
<d:Date_x0020_Created m:type="Edm.DateTime">2022-11-15T15:01:17</d:Date_x0020_Created>
</m:properties>
</content>
</entry>
<entry>
<id>uuid:58aaa654-1649-4061-97fe-ecc53e186fee;id=2</id>
<title type="text"></title>
<updated>2024-07-18T05:22:18Z</updated>
<category term="Intelex.Data" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:RowNum m:type="Edm.Int32">2</d:RowNum>
<d:Record_x0020_No m:type="Edm.Int64">586</d:Record_x0020_No>
<d:Location m:type="Edm.String">Ruby > DNB-NT</d:Location>
<d:MOC_x0020_Initiator m:type="Edm.String">XYZ</d:MOC_x0020_Initiator>
<d:Status m:type="Edm.String">Vertical Heads Approval</d:Status>
<d:Workflow_x0020_Status m:type="Edm.String">1</d:Workflow_x0020_Status>
<d:Date_x0020_Created m:type="Edm.DateTime">2022-12-07T14:14:19</d:Date_x0020_Created>
</m:properties>
</content>
</entry>
</feed>
"""
I am trying to convert this in pandas dataframe using pd.read_xml().
import pandas as pd
import xml.etree.ElementTree as ET
x = ET.parse(xml_s)
root = x.getroot()
s = ET.tostring(root).decode()
namespace =
{'ns_0': 'http://www.w3.org/2005/Atom',
'ns_1': 'http://schemas.microsoft.com/ado/2007/08/dataservices',
'ns_2': 'http://schemas.microsoft.com/ado/2007/08/dataservices/metadata'}
df = pd.read_xml(s,namespaces = namespace, xpath='.//properties')
But this is giving me an error.
ValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If
document uses namespaces denoted with xmlns, be sure to define namespaces and use them
in xpath.
Then I noticed that properties
tag is mapped with m
namespace. Hence I supplied xpath = './/ns_2:properties'
and this has worked!!
So with that my question would be:
- How can I dynamically identify the “last child node” of any xml. My assumption is that within the “last child node” the actual data would be present.
- Since in this case the last child node was associated with
m
namespace, how this mapping can be done dynamically?