I cannot get the extraction of this xml data to dataframe to work properly.
This is my xml sample. In reality I have multiple stacks of “Entity” which represent one line of data in the desired result. To save space I only pasted one instance of “Entity” here:
<EntityCollection>
<Entities>
<Entity>
<Attributes>
<KeyValuePairOfstringanyType>
<key>client_number</key>
<value type="b:string">ABC123345</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.client_type</key>
<value type="AliasedValue">
<AttributeLogicalName>my_type</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>123</ReturnType>
<Value type="OptionSetValue">
<Value>345</Value>
</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.status</key>
<value type="AliasedValue">
<AttributeLogicalName>status</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>123</ReturnType>
<Value type="OptionSetValue">
<Value>67</Value>
</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.date</key>
<value type="AliasedValue">
<AttributeLogicalName>date</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>89</ReturnType>
<Value type="b:string">2024-01-01</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.country</key>
<value type="AliasedValue">
<AttributeLogicalName>country</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>123</ReturnType>
<Value type="OptionSetValue">
<Value>456</Value>
</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>client_code</key>
<value type="b:guid">some_code456</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.my_data1id</key>
<value type="AliasedValue">
<AttributeLogicalName>my_data1id</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>13</ReturnType>
<Value type="b:guid">some_code123</Value>
</value>
</KeyValuePairOfstringanyType>
</Attributes>
<EntityState nil="true"/>
<FormattedValues>
<KeyValuePairOfstringstring>
<key>my_data.client_type</key>
<value>client123</value>
</KeyValuePairOfstringstring>
<KeyValuePairOfstringstring>
<key>my_data.status</key>
<value>OK</value>
</KeyValuePairOfstringstring>
<KeyValuePairOfstringstring>
<key>my_data.country</key>
<value>some_country</value>
</KeyValuePairOfstringstring>
</FormattedValues>
<Id>some_code456</Id>
<KeyAttributes/>
<LogicalName>some-id</LogicalName>
<RelatedEntities/>
<RowVersion>ID123</RowVersion>
</Entity>
</Entities>
</EntityCollection>
The elements in ‘key’ (or in ‘AttributeLogicalName’) should be the headers. The “value”/”Value” should be the actual data. This is the desired outcome (first row).
As the data is very nested, I tried using xml.etree but did not come very far as I not sure how to proceed form here:
for _, elem in ET.iterparse(my_xml):
if len(elem) == 0:
print(f'{elem.tag} {elem.attrib} text={elem.text}')
else:
print(f'{elem.tag} {elem.attrib}')
How can I get the desired result?
1
Consider XSLT, to transform the nested XML for the specific entity, key and value/Value nodes. Then, use pandas to pivot
by entity and due to repeated keys, suffixed numbers are added to key names. Pandas read_xml
supports XSLT 1.0 using the lxml
parser.
XSLT (save as .xsl file)
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/EntityCollection/Entities">
<root>
<xsl:apply-templates select="Entity"/>
</root>
</xsl:template>
<xsl:template match="Entity">
<xsl:apply-templates select="descendant::KeyValuePairOfstringanyType|descendant::KeyValuePairOfstringstring"/>
</xsl:template>
<xsl:template match="KeyValuePairOfstringanyType|KeyValuePairOfstringstring">
<data>
<entity><xsl:value-of select="ancestor::Entity/Id"/></entity>
<xsl:copy-of select="descendant::key"/>
<value><xsl:value-of select="descendant::Value[not(*)]|descendant::value[not(*)]"/></value>
</data>
</xsl:template>
</xsl:stylesheet>
XSLT Output (long format to scale to any key and value)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<data>
<entity>some_code456</entity>
<key>client_number</key>
<value>ABC123345</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.client_type</key>
<value>345</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.status</key>
<value>67</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.date</key>
<value>2024-01-01</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.country</key>
<value>456</value>
</data>
<data>
<entity>some_code456</entity>
<key>client_code</key>
<value>some_code456</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.my_data1id</key>
<value>some_code123</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.client_type</key>
<value>client123</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.status</key>
<value>OK</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.country</key>
<value>some_country</value>
</data>
</root>
Python (handles duplicate keys with suffixed count)
entities_df = (
pd.read_xml("input.xml", stylesheet="style.xsl")
.assign(key = lambda df: df["key"].str.cat(df.groupby("key").cumcount().add(1).astype('str')))
.pivot(index="entity", columns="key", values="value")
)
Python Output
entities_df
# key client_code1 client_number1 my_data.client_type1 my_data.client_type2 ... my_data.date1 my_data.my_data1id1 my_data.status1 my_data.status2
# entity ...
# some_code456 some_code456 ABC123345 345 client123 ... 2024-01-01 some_code123 67 OK
# [1 rows x 10 columns]