DI Management Home > Cryptography > Canonicalization of an XML document

Canonicalization of an XML document


This page looks at canonicalization, how to canonicalize an XML document that is to be signed using XML-Signature Syntax and Processing [XML-DSIG]. The word CanonicalizatioN is shortened by those in the know to "C14N", the letters "C" and "N" with 14 other letters inbetween. This page expands on our earlier examples of C14N in our pages Signing an XML document using XMLDSIG (Part 1) and Part 2, and XML-DSIG and the Chile SII.

Canonicalization is a method for generating a physical representation, the canonical form, of an XML document that accounts for syntactic changes permitted by the XML specification [XMLSPEC].

In other words, no matter what changes could be made to a given XML document under transmission, the canonical form will always be identical, byte-for-byte. This byte sequence is critical when signing an XML document or verifiying its signature.

First, some general guidance, with apologies to Douglas Adams.

C14N is complicated. Really complicated. You just won't believe how vastly, hugely, mindbogglingly complicated it is.

Contents

Introduction | Extracting the subset to be signed | The procedure | Useful utilities | References | Contact

Introduction

This page is intended as a guide to help you do the c14n process "by hand", that is, working in a text editor or writing a simple program that deals with the data as a text file, or using a simple XML processor. We're assuming you have some degree of control over the structure of your XML document-to-be-signed, so you can avoid the messy bits. (It's actually very difficult to write something to deal with any arbitrary XML document.)

There are some really complicated arcane rules to deal with - for example, Processing Instructions and DTD files - but most XML documents people want to sign don't have these. Most people just want to sign a SOAP document or an invoice structured in a certain format specified by, say, a government tax authority, which don't need these XML complications. So don't use them!

We are only dealing with one specific Canonicalization Method here, sometimes called "Inclusive Canonicalization", specified in Canonical XML Version 1.0 [XML-C14N] with the reference

http://www.w3.org/TR/2001/REC-xml-c14n-20010315

There is also "Exclusive Canonicalization" (versions 1.0 and 1.1) which actually makes more sense for enveloped SOAP signatures because it doesn't invalidate the signature when you wrap something that is already signed. However, it is an order of magnitude harder to automate, so we won't cover it here. There are also flavours "With Comments". Well, punk, do you really need your comments in an XML invoice to be signed? No, you probably don't. So that's one more complication avoided.

The end result is that we take an XML document (a text file that may come in many forms, all equivalent from an XML point of view), and we convert it (or part of it) to another file that must be an exact sequence of bytes. We will be computing the message digest value of this sequence of bytes. If we get just one byte wrong, our digital signature will be wrong.

We explain this here in terms of creating a new file with the c14n'd data which can have its digest value computed at any time. In our experience this is the easiest way to handle a single example. Obviously you could write a program to do this using strings and byte arrays in your favourite programming language. (Hint: always store the UTF-8 encoded XML data in a byte array, not a string.)

NewIs there a program that does this?

Well, yes. Please see our new program SC14N, a straightforward XML canonicalization utility, first released 11 July 2017. This should do all you need. SC14N comes with a command-line program and application interfaces for C/C++, C#, VB.NET and Python programming languages.

Of course, if you still need to do it by hand, please keep reading ...

Extracting the subset of the document to be signed

Cut and paste the relevant part of the document you want to canonicalize into a separate file. In our straightforward examples, there are typically three cases.
  1. You are canonicalizing the entire document (a reference with URI="") excluding the Signature element.
  2. You are canonicalizing a subset of the document with a given Id.
  3. You are canonicalizing the SignedInfo element.
As with all XML stuff, there are many other cases, most of which you should avoid. In particular, avoid references that use XPath. There be dragons. Just don't go there!

Case 1: Copy the entire root element from its opening "<" to its closing ">" and cut out the <Signature> element. Do this cutting carefully, leaving any white space before the first "<" of the Signature opening tag and any white space after the ">" of its closing tag.

<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
  <Body>
  ...
  </Body>
  <Signature>
    <SignedInfo>
      <Reference URI="">
      ...
    </SignedInfo>
    ...
  <Signature>
</Envelope>
<Envelope>
  <Body>
  ...
  </Body>
  
</Envelope>

Case 2: Copy the element with the matching reference from its opening "<" to its closing ">".

<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
  <Part>
    <Doc Id="P666">
    ...
    </Doc>
    <Signature>
      <SignedInfo>
        <Reference URI="P666">
        ...
      </SignedInfo>
      ...
    <Signature>
  </Part>
</Envelope>
<Doc Id="P666">
...
</Doc>
A variant of this case is the Object element in an enveloping signature, see an example here. The end result is the same.

Case 3: Copy the relevant SignedInfo element from its opening "<" to its closing ">". Make sure you have included the correct DigestValue element.

<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
  <Part>
    <Doc Id="P666">
    ...
    </Doc>
    <Signature>
      <SignedInfo>
        <Reference URI="P666">
        ...
        <DigestValue>...</DigestValue>
        </Reference>
      </SignedInfo>
      ...
    <Signature>
  </Part>
</Envelope>
<SignedInfo>
  <Reference URI="P666">
  ...
  <DigestValue>...</DigestValue>
  </Reference>
</SignedInfo>

The procedure

The 2001 W3C document [XML-C14N] summarizes the transformation process in section 1.1. We'll deal with these steps in order of increasing complexity, and we'll leave the ones you should avoid to the end.

The XML declaration and document type declaration (DTD) are removed.
Whitespace outside of the document element is normalized.
Remove everything before the root element's opening tag, including the XML declaration and any in-line DTD (you're using a DTD!) and any byte-order mark (BOM). Remove all whitespace before the root element's opening tag <Root> and after its closing tag </Root>. The end result is that your c14n'd file starts with a < and ends with a >.

There are special whitespace "normalizing" rules if you have a processing instruction (PI) before the root element, but, honestly, you should avoid these things.

Remove all comments
Remove all comments <!-- An XML comment -->. Make sure you leave any surrounding whitespace intact.

Suggestion: don't have any comments in the first place.

The document is encoded in UTF-8.
This is an issue if your file is encoded in, say, ISO-8859-1 (Latin-1).
  • If using Notepad++, select Encoding | Convert to UTF-8 (but do not choose any of the BOM options).
  • On Linux, use the iconv command iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
  • Using VBA/VB6, see How to convert VBA/VB6 Unicode strings to UTF-8.

Delete any Byte Order mark (BOM) from the file. A byte-order mark for UTF-8 is the sequence of three bytes (0xEF,0xBB,0xBF) at the beginning of a file. This generally does not show up in a text editor, but it needs to be removed.

Note that a file consisting entirely of US-ASCII characters (all byte values < 128) is already UTF-8 encoded.

Line breaks normalized to #xA on input, before parsing
We're looking here at converting "Windows" CR-LF line endings to "Unix" LF line endings (0x0A, or #xA in xml-ese). If you've created your file on a Windows system, the chances are it has the wrong line endings.
  • With Notepad++, select Edit | EOL Conversion | Unix (LF).
  • On Linux systems use the utility dos2unix or fromdos.
  • If using a regex, do some variant of s/\r\n/\n/g.
  • With Perl on Windows, perl -pe "binmode(STDOUT);s/\R/\012/" input.xml > output.xml (thanks to Perl Tricks)
  • With C#, myString = myString.Replace("\r\n", "\n");.
All whitespace in character content is retained (excluding characters removed during line feed normalization)
So don't change any whitespace between elements and whitespace inside an element. However, we will be changing the whitespace for attributes inside a start tag.

An example to demonstrate some of the above. We wish to canonicalize the root <Envelope> element.

<?xml version="1.0" encoding="ISO-8859-1"?>

<Envelope>
<!-- some comment -->
  <Body>
    Olá mundo
  </Body>

</Envelope>

<Envelope>

  <Body>
    Olá mundo
  </Body>

</Envelope>

We strip off the XML declaration and any white space before or after the root <Envelope> element. We remove the comment, leaving the newline after it intact. Then we make changes to line endings and data encoding which are not obvious in a text editor. Let's examine the data bytes before and after c14n using hexdump.

Before: ISO-8859-1 encoding and Windows CR-LF line endings.
000000  3c 45 6e 76 65 6c 6f 70 65 3e 0d 0a 20 20 3c 42  <Envelope>..  <B
000010  6f 64 79 3e 0d 0a 20 20 20 20 4f 6c e1 20 6d 75  ody>..    Ol. mu
000020  6e 64 6f 0d 0a 20 20 3c 2f 42 6f 64 79 3e 0d 0a  ndo..  </Body>..
000030  0d 0a 3c 2f 45 6e 76 65 6c 6f 70 65 3e           ..</Envelope>
After: UTF-8 encoding and Unix (LF) line endings.
000000  3c 45 6e 76 65 6c 6f 70 65 3e 0a 20 20 3c 42 6f  <Envelope>.  <Bo
000010  64 79 3e 0a 20 20 20 20 4f 6c c3 a1 20 6d 75 6e  dy>.    Ol.. mun
000020  64 6f 0a 20 20 3c 2f 42 6f 64 79 3e 0a 0a 3c 2f  do.  </Body>..</
000030  45 6e 76 65 6c 6f 70 65 3e                       Envelope>
Note that
  • Each CR-LF line ending 0d 0a has been replaced by LF 0a
  • The letter á is changed from Latin-1 encoding e1 to UTF-8 c3 a1
  • All space characters (0x20) and newlines between the elements have been retained.
  • The first character is "<" (U+003C) and the last character is ">" (U+003E).
Empty elements are converted to start-end tag pairs
Change all empty-element tags of the form <foo/> to the start-end tag pair form <foo></foo>. Do not put any white space between the ">" and the "<". For example:
<DigestMethod Algorithm="http:...#sha1" />
<DigestMethod Algorithm="http:...#sha1"></DigestMethod>
Attribute values are normalized.
Attribute value delimiters are set to quotation marks (double quotes).
Whitespace within start and end tags is normalized
Rewrite all the attributes so any white space inside a tag between attributes is replaced by a single space and all attribute values are surrounded by double quotes ("). This means no line breaks and exactly one space between attributes.

We normalize the attribute values by replacing any whitespace character - SPACE (0x20), CR (0x0d), LF (0x0a), TAB (0x09) - with a single SPACE character (0x20).

Character references are treated differently, see below. There are some other subtle normalization rules for attributes that are not CDATA (this should not affect you in practice with "straightforward" XML documents).

<e1   a='one'
  b  = 'two'  >
<e1 a="one" b="two">
<e2 C=' letter
	A ' >
<e2 A=" letter  A ">

In the second example above, the original attribute value for "C" is (SPACE)letter(CR)(LF)(TAB)A(SPACE). The canonicalized value is (SPACE)letter(SPACE)(SPACE)A(SPACE). The CR-LF pair should have already been reduced to a single LF character which is then normalized to a single space.

All space between the final double quote in a start tag and the closing ">" is removed, as are all spaces in a closing tag.

<e3  d= "foo"  >bar</e3   >
<e3 d="foo">bar</e3>

We also need to sort the attributes in "lexicographic" order and deal with special characters in the attribute values. We'll discuss this below.

Namespaces are propagated down from any parent element.
In many cases, you are canonicalizing a subset (fragment) of a complete XML document. In that case you must propagate down any namespaces in the parent element to the root element of the subset. With the inclusive canonicalization we are doing, all namespaces are propagated down, even if the subset does not use them.
<?xml version="1.0" encoding="UTF-8"?>
<Envelope xmlns="http://www.example.com">
  <Part xmlns:ab="http://www.ab.com">
    <Doc Id="P666">
    ...
    </Doc>
    <Signature xmlns="http://www.w3.org/2000/09/xmldsig#">
      <SignedInfo>
        <Reference URI="P666">
        ...
      </SignedInfo>
      ...
    <Signature>
  </Part>
</Envelope>
<Doc xmlns="http://www.example.com" xmlns:ab="http://www.ab.com" Id="P666">
...
</Doc>
<SignedInfo xmlns="http://www.w3.org/2000/09/xmldsig#" xmlns:ab="http://www.ab.com">
        <Reference URI="P666">
        ...
      </SignedInfo>
Note that the default "xmldsig" namespace from the Signature element overrides the "example.com" one for the c14n'd SignedInfo element.
Superfluous namespace declarations are removed from each element.
A supefluous namespace is one that has already been declared in a direct parent of a sub-element, and that parent is in scope for the c14n'd part.

In the following example, we are canonicalizing the entire Envelope element including the Signature. The "ab" namespace in the Doc element has already been declared with the same attribute value in the parent Part, so it is removed. Similarly, the "xmldsig" namespace in the SignedInfo element has already been declared in its parent, so it is removed.

<?xml version="1.0" encoding="UTF-8"?>
<Envelope xmlns="http://www.example.com">
  <Part xmlns:ab="http://www.ab.com">
    <Doc Id="P666" xmlns:ab="http://www.ab.com">
    ...
    </Doc>
    <Signature xmlns="http://www.w3.org/2000/09/xmldsig#">
      <SignedInfo xmlns="http://www.w3.org/2000/09/xmldsig#">
        <Reference URI="P666">
        ...
      </SignedInfo>
      ...
    <Signature>
  </Part>
</Envelope>
<Envelope xmlns="http://www.example.com">
  <Part xmlns:ab="http://www.ab.com">
    <Doc Id="P666">
    ...
    </Doc>
    <Signature xmlns="http://www.w3.org/2000/09/xmldsig#">
      <SignedInfo>
        <Reference URI="P666">
        ...
      </SignedInfo>
      ...
    <Signature>
  </Part>
</Envelope>
There are some extra messy rules for the default namespace xmlns="". Just don't use it!
Lexicographic order is imposed on the namespace declarations and attributes of each element

The c14n ordering of attributes is as follows.

  1. The default namespace declaration xmlns="...", if any, comes first.
  2. Namespace declarations, sorted by prefix (the part after "xmlns:"). So xmlns:a="http://www.w3.org" comes before xmlns:b="http://www.ietf.org".
  3. Unqualified attributes, sorted by name. So attr="..." comes before attr2="...".
  4. Qualified attributes, sorted by namespace URI then name. So b:attr="..." comes before a:attr="...", because we read this as http://www.ietf.org:attr="..." comes before http://www.w3.org:attr="...". And a:attr="..." comes before a:attr2="..."
<e xmlns="http://example.org" xmlns:a="http://www.w3.org" xmlns:b="http://www.ietf.org" attr="I'm" attr2="all" b:attr="sorted" a:attr="out" a:attr2="now"></e>

For an excellent explanation of the rules to sort attributes when canonicalizing your data for XML-DSIG, see Keith S. Beattie's article on attribute ordering KSB's XML C14N Notes.

Character references are replaced.
Special characters in attribute values and character content are replaced by character references
A reminder. An XML character reference begins with "&#" and ends with a ";". These are meant to be used to input characters that aren't on your keyboard, or enter characters that already have a meaning in XML, like "<".

For example, we could write &#xE1; to represent the letter á, or we could equivalently write &#225; or &#x00e1;.

The general rule: With a few exceptions (see below), all character references are changed to the actual UTF-8-encoded representation of the character.

So, for example, the character reference &#xE1; is replaced by the two bytes 0xc3 0xa1 (which should show as á in a UTF-8 compliant text editor). The character reference &#64; representing the "COMMERCIAL AT" symbol "@" is replaced by the byte 0x40, its UTF-8 encoding. The character reference &#20013; for the chinese character 中 (U+4E2D) is replaced by its UTF-8 encoding, the three bytes 0xE4 0xB8 0xAD.

Before we go on, let's just remind ourselves of some XML terminology:
<tag>content</tag>
<tag attribute-name="attribute-value">content</tag>
The "content" is the text between the opening tag and the closing tag of an element. An "attribute-value" is the text between the "delimiters" (quotes) in an attribute.

The exceptions: The exceptions are the five XML predefined entities (amp, lt, gt, apos, quot) and certain white space characters. The treatment is different depending whether they are in element content or in an attribute value. Here's a summary of the main rules.

  • In all cases, the character & is written as &amp; and < is written as &lt;
  • The single quote/apostrophe ' is always left as is, and the entity &apos; is always changed to ', encoded as the byte 0x27.
  • The double quote " is left as is in element content (the byte 0x22), but is changed to &quot; in an attribute value.
  • The greater-than symbol > is changed to &gt; in element content, but left as is in an attribute value (byte 0x3E).
  • An isolated CR character (the byte 0x0d or the entity &#xD or equivalent) in element content or an attribute value is always replaced by the character reference &#xD with the hexadecimal value "D" in uppercase and no leading zeros.
  • The whitespace characters TAB (0x09) and LF (0x0A) in an attribute value are replaced by the character references "&#x9" and "&#xA", respectively. But they are left as is in element content.
  • The correct c14n form of those few character references left in is "uppercase hexadecimal with no leading zeros". So &#xA; is correct, but &#xa;, &#10; and &#x00A; are not.

Make sure you have already changed your attribute value delimiters to double quotes before doing the above.

Hint: If you can, avoid using these messy whitespace characters other than a space in attribute values. In fact, for attribute values, try and avoid all the cases in the exceptions above.
CDATA sections are replaced with their character content
CDATA sections are used to escape blocks of text containing characters which would otherwise be recognized as markup, such as left angle brackets "<" and ampersands "&". This makes typing, say, verbatim XML like the example below much less messy. To canonicalize it we put it back in its messy form.
<doc>
<![CDATA[
<contact>
<name>Fred Bloggs</name>
<coy>Branston & Pickle</coy>
</contact>
]]>
</doc>
<doc>

&lt;contact&gt;
&lt;name&gt;Fred Bloggs&lt;/name&gt;
&lt;coy&gt;Branston &amp; Pickle&lt;/coy&gt;
&lt;/contact&gt;

</doc>
Note that, in this example, there new lines which exist before the <![CDATA[ and after the ]]>. These are retained in the c14n transform.
Parsed entity references are replaced
Example:
<?xml version="1.0"?>
<!DOCTYPE doc [
<!ENTITY ourname "DI Management">
]>

<doc>&ourname;</doc>
<doc>DI Management</doc>
Default attributes are added to each element
You should not come across a default attribute in practice. They need to be specified in a DTD and you are not using a DTD, are you? If you do and you have a default attribute that is not in the original, you need to add it.
<!DOCTYPE doc [<!ATTLIST e1 attr CDATA "default">]>   
<doc>
   <e1   />
</doc> 
<doc>
   <e1 attr="default"></e1>
</doc>

Useful utilities

References

Contact

For more information, or to comment on this page, please send us a message.

This page first published 28 June 2017. Last updated 16 July 2017.