I need to get the same hash of an xml in any language.
I tried to get the xml’s canonical form then get it’s hash
But what I experienced was that the canonical is not a “fixed standard”. It is implemented in different forms by all the libs and languages that I worked with… so, I never get the SAME hash.
So, my question is: is there a way to get a trustable hash of the same canonical XML?
Edit
I’m using this xml as example:
<doc xmlns="http://www.ietf.org" xmlns:w3c="http://www.w3.org" xml:base="something/else">
<e1>
<e2 xmlns="" xml:id="abc" xml:base="bar/">
<e3 id="E3" xml:base="foo"/>
</e2>
</e1>
</doc>
To get the Canonical Form, I’ve used:
In c# (.net version 8):
- Lib System.Security.Cryptography.Xml (version 8.0.0)
string stringXml = "<doc xmlns="http://www.ietf.org" xmlns:w3c="http://www.w3.org" xml:base="something/else">n <e1>n <e2 xmlns="" xml:id="abc" xml:base="bar/">n <e3 id="E3" xml:base="foo"/>n </e2>n </e1>n</doc>";
System.Security.Cryptography.Xml.XmlDsigC14NWithCommentsTransform c14n = new();
System.Xml.XmlDocument documentXml = new();
documentXml.LoadXml(stringXml);
c14n.LoadInput(documentXml);
Stream stream = (Stream)c14n.GetOutput(typeof(Stream));
string result = new StreamReader(stream).ReadToEnd();
using var hash = System.Security.Cryptography.SHA256.Create();
var byteArray = hash.ComputeHash(System.Text.Encoding.UTF8.GetBytes(result));
string sha256hex = Convert.ToHexString(byteArray);
Console.WriteLine(sha256hex);
- The sha256hex result was:
4716238DE66819B69981AE1BD3943451D0EADEEA001583D27CDFDC4255484CB6
- The canonicalized result was:
<doc xmlns="http://www.ietf.org" xmlns:w3c="http://www.w3.org" xml:base="something/else"><e1><e2 xmlns="" xml:base="bar/" xml:id="abc"><e3 id="E3" xml:base="foo"></e3></e2></e1></doc>
In java (version 21):
-
Lib org.apache.santuario xmlsec (version 4.0.2)
-
And Lib commons-codec (version 1.17.0) (just for hashing)
String stringXml= "<doc xmlns="http://www.ietf.org" xmlns:w3c="http://www.w3.org" xml:base="something/else">n <e1>n <e2 xmlns="" xml:id="abc" xml:base="bar/">n <e3 id="E3" xml:base="foo"/>n </e2>n </e1>n</doc>";
org.apache.xml.security.Init.init();
org.apache.xml.security.c14n.Canonicalizer c14n = org.apache.xml.security.c14n.Canonicalizer.getInstance(org.apache.xml.security.c14n.Canonicalizer.ALGO_ID_C14N_WITH_COMMENTS);
java.io.ByteArrayOutputStream stream = new java.io.ByteArrayOutputStream();
c14n.canonicalize(stringXml.getBytes(), stream, false);
String result = stream.toString(java.nio.charset.StandardCharsets.UTF_8);
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(result);
System.out.println(sha256hex);
- The sha256hex result was:
dea874fbbe21f9e27e521cfddf61aa54bc1b0b18692e3105455eeca24beea1f6
- The canonicalized result was:
<doc xmlns="http://www.ietf.org" xmlns:w3c="http://www.w3.org" xml:base="something/else">
<e1>
<e2 xmlns="" xml:base="bar/" xml:id="abc">
<e3 id="E3" xml:base="foo"></e3>
</e2>
</e1>
</doc>
20
Your problem is that in the C# code, prior to canonicalization, you parsed the XML document with a parser that strips whitespace, whereas in Java, you preserved the whitespace. In general whitespace in XML is significant, though of course in some particular document types it isn’t. You need to decide whether to strip whitespace prior to canonicalization or not; you will get different signatures (and different canonicalizations) depending on whether you strip it or not.
With C#, to avoid stripping whitespace, set the PreserveWhitespace
property on the XmlDocument
.
If instead you want to add whitespace stripping to the Java code, the simplest way is to run a simple XSLT transformation:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:strip-space elements="*"/>
<xsl:on-no-match="shallow-copy"/>
</xsl:transform>
1
In both your C# and Java code, you are simply hashing the byte sequence contained in the string variable result. This will only yield the same hash value if both strings are byte for byte identical. Even a 1-bit difference in the two strings will produce completely different hash values. So you canonicalization procedure has to produce identical strings to give the same hash.
To diagnose the issue, save the string result to a file and compare with some sort of file comparison utility (Beyond Compare, WinDiff or even the CMD prompt’s fc).
2