Canonicalization of an XML document
This page looks at canonicalization, how to canonicalize an XML document that is to be signed using XML-Signature Syntax and Processing [XML-DSIG]. The word CanonicalizatioN is shortened by those in the know to "C14N", the letters "C" and "N" with 14 other letters inbetween. This page expands on our earlier examples of C14N in our pages Signing an XML document using XMLDSIG (Part 1) and Part 2, and XML-DSIG and the Chile SII.
Canonicalization is a method for generating a physical representation, the canonical form, of an XML document that accounts for syntactic changes permitted by the XML specification [XMLSPEC].
In other words, no matter what changes could be made to a given XML document under transmission, the canonical form will always be identical, byte-for-byte. This byte sequence is critical when signing an XML document or verifiying its signature.
First, some general guidance, with apologies to Douglas Adams.
C14N is complicated. Really complicated. You just won't believe how vastly, hugely, mindbogglingly complicated it is.
This page is intended as a guide to help you do the c14n process "by hand", that is, working in a text editor or writing a simple program that deals with the data as a text file, or using a simple XML processor. We're assuming you have some degree of control over the structure of your XML document-to-be-signed, so you can avoid the messy bits. (It's actually very difficult to write something to deal with any arbitrary XML document.)
There are some really complicated arcane rules to deal with - for example, Processing Instructions and DTD files - but most XML documents people want to sign don't have these. Most people just want to sign a SOAP document or an invoice structured in a certain format specified by, say, a government tax authority, which don't need these XML complications. So don't use them!
We are only dealing with one specific Canonicalization Method here, sometimes called "Inclusive Canonicalization", specified in Canonical XML Version 1.0 [XML-C14N] with the reference
There is also "Exclusive Canonicalization" (versions 1.0 and 1.1) which actually makes more sense for enveloped SOAP signatures because it doesn't invalidate the signature when you wrap something that is already signed. However, it is an order of magnitude harder to automate, so we won't cover it here. There are also flavours "With Comments". Well, punk, do you really need your comments in an XML invoice to be signed? No, you probably don't. So that's one more complication avoided.
The end result is that we take an XML document (a text file that may come in many forms, all equivalent from an XML point of view), and we convert it (or part of it) to another file that must be an exact sequence of bytes. We will be computing the message digest value of this sequence of bytes. If we get just one byte wrong, our digital signature will be wrong.
We explain this here in terms of creating a new file with the c14n'd data which can have its digest value computed at any time. In our experience this is the easiest way to handle a single example. Obviously you could write a program to do this using strings and byte arrays in your favourite programming language. (Hint: always store the UTF-8 encoded XML data in a byte array, not a string.)
Is there a program that does this?
Well, yes. Please see our new program SC14N, a straightforward XML canonicalization utility, first released 11 July 2017. This should do all you need. SC14N comes with a command-line program and application interfaces for C/C++, C#, VB.NET and Python programming languages.
Of course, if you still need to do it by hand, please keep reading ...
- You are canonicalizing the entire document (a reference with
URI="") excluding the
- You are canonicalizing a subset of the document with a given Id.
- You are canonicalizing the
Case 1: Copy the entire root element from its opening "<" to its closing ">" and cut out the
Do this cutting carefully, leaving any white space before the first "<" of the
Signature opening tag
and any white space after the ">" of its closing tag.
<?xml version="1.0" encoding="UTF-8"?> <Envelope> <Body> ... </Body> <Signature> <SignedInfo> <Reference URI=""> ... </SignedInfo> ... <Signature> </Envelope>
<Envelope> <Body> ... </Body> </Envelope>
Case 2: Copy the element with the matching reference from its opening "<" to its closing ">".
<?xml version="1.0" encoding="UTF-8"?> <Envelope> <Part> <Doc Id="P666"> ... </Doc> <Signature> <SignedInfo> <Reference URI="P666"> ... </SignedInfo> ... <Signature> </Part> </Envelope>
<Doc Id="P666"> ... </Doc>
Case 3: Copy the relevant
SignedInfo element from its opening "<" to its closing ">".
Make sure you have included the correct
<?xml version="1.0" encoding="UTF-8"?> <Envelope> <Part> <Doc Id="P666"> ... </Doc> <Signature> <SignedInfo> <Reference URI="P666"> ... <DigestValue>...</DigestValue> </Reference> </SignedInfo> ... <Signature> </Part> </Envelope>
<SignedInfo> <Reference URI="P666"> ... <DigestValue>...</DigestValue> </Reference> </SignedInfo>
The 2001 W3C document [XML-C14N] summarizes the transformation process in section 1.1. We'll deal with these steps in order of increasing complexity, and we'll leave the ones you should avoid to the end.
The XML declaration and document type declaration (DTD) are removed.
Whitespace outside of the document element is normalized.
Remove everything before the root element's opening tag, including the XML declaration and any in-line DTD (you're using a DTD!) and any byte-order mark (BOM).
Remove all whitespace before the root element's opening tag
<Root>and after its closing tag
</Root>. The end result is that your c14n'd file starts with a
<and ends with a
There are special whitespace "normalizing" rules if you have a processing instruction (PI) before the root element, but, honestly, you should avoid these things.
- Remove all comments
Remove all comments
<!-- An XML comment -->. Make sure you leave any surrounding whitespace intact.
Suggestion: don't have any comments in the first place.
- The document is encoded in UTF-8.
This is an issue if your file is encoded in, say, ISO-8859-1 (Latin-1).
- If using Notepad++, select Encoding | Convert to UTF-8 (but do not choose any of the BOM options).
- On Linux, use the
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
- Using VBA/VB6, see How to convert VBA/VB6 Unicode strings to UTF-8.
Delete any Byte Order mark (BOM) from the file. A byte-order mark for UTF-8 is the sequence of three bytes
(0xEF,0xBB,0xBF)at the beginning of a file. This generally does not show up in a text editor, but it needs to be removed.
Note that a file consisting entirely of US-ASCII characters (all byte values < 128) is already UTF-8 encoded.
- Line breaks normalized to #xA on input, before parsing
We're looking here at converting "Windows" CR-LF line endings to "Unix" LF line endings (0x0A, or #xA in xml-ese).
If you've created your file on a Windows system, the chances are it has the wrong line endings.
- With Notepad++, select Edit | EOL Conversion | Unix (LF).
- On Linux systems use the utility
- If using a regex, do some variant of
- With Perl on Windows,
perl -pe "binmode(STDOUT);s/\R/\012/" input.xml > output.xml(thanks to Perl Tricks)
- With C#,
myString = myString.Replace("\r\n", "\n");.
- All whitespace in character content is retained (excluding characters removed during line feed normalization)
So don't change any whitespace between elements and whitespace inside an element.
However, we will be changing the whitespace for attributes inside a start tag.
An example to demonstrate some of the above. We wish to canonicalize the root
<?xml version="1.0" encoding="ISO-8859-1"?> <Envelope> <!-- some comment --> <Body> Olá mundo </Body> </Envelope>
<Envelope> <Body> Olá mundo </Body> </Envelope>
We strip off the XML declaration and any white space before or after the rootBefore: ISO-8859-1 encoding and Windows CR-LF line endings.
<Envelope>element. We remove the comment, leaving the newline after it intact. Then we make changes to line endings and data encoding which are not obvious in a text editor. Let's examine the data bytes before and after c14n using hexdump.
000000 3c 45 6e 76 65 6c 6f 70 65 3e 0d 0a 20 20 3c 42 <Envelope>.. <B 000010 6f 64 79 3e 0d 0a 20 20 20 20 4f 6c e1 20 6d 75 ody>.. Ol. mu 000020 6e 64 6f 0d 0a 20 20 3c 2f 42 6f 64 79 3e 0d 0a ndo.. </Body>.. 000030 0d 0a 3c 2f 45 6e 76 65 6c 6f 70 65 3e ..</Envelope>After: UTF-8 encoding and Unix (LF) line endings.
000000 3c 45 6e 76 65 6c 6f 70 65 3e 0a 20 20 3c 42 6f <Envelope>. <Bo 000010 64 79 3e 0a 20 20 20 20 4f 6c c3 a1 20 6d 75 6e dy>. Ol.. mun 000020 64 6f 0a 20 20 3c 2f 42 6f 64 79 3e 0a 0a 3c 2f do. </Body>..</ 000030 45 6e 76 65 6c 6f 70 65 3e Envelope>Note that
- Each CR-LF line ending
0d 0ahas been replaced by LF
- The letter á is changed from Latin-1 encoding
- All space characters (0x20) and newlines between the elements have been retained.
- The first character is "<" (U+003C) and the last character is ">" (U+003E).
- Each CR-LF line ending
- Empty elements are converted to start-end tag pairs
Change all empty-element tags of the form
<foo/>to the start-end tag pair form
<foo></foo>. Do not put any white space between the ">" and the "<". For example:
<DigestMethod Algorithm="http:...#sha1" />
Attribute values are normalized.
Attribute value delimiters are set to quotation marks (double quotes).
Whitespace within start and end tags is normalized
Rewrite all the attributes so any white space inside a tag between attributes is
replaced by a single space and all attribute values are surrounded by double quotes (").
This means no line breaks and exactly one space between attributes.
We normalize the attribute values by replacing any whitespace character - SPACE (0x20), CR (0x0d), LF (0x0a), TAB (0x09) - with a single SPACE character (0x20).
Character references are treated differently, see below. There are some other subtle normalization rules for attributes that are not CDATA (this should not affect you in practice with "straightforward" XML documents).
<e1 a='one' b = 'two' >
<e1 a="one" b="two">
<e2 C=' letter A ' >
<e2 A=" letter A ">
In the second example above, the original attribute value for "C" is
(SPACE)letter(CR)(LF)(TAB)A(SPACE). The canonicalized value is
(SPACE)letter(SPACE)(SPACE)A(SPACE). The CR-LF pair should have already been reduced to a single LF character which is then normalized to a single space.
All space between the final double quote in a start tag and the closing ">" is removed, as are all spaces in a closing tag.
<e3 d= "foo" >bar</e3 >
We also need to sort the attributes in "lexicographic" order and deal with special characters in the attribute values. We'll discuss this below.
- Namespaces are propagated down from any parent element.
In many cases, you are canonicalizing a subset (fragment) of a complete XML document.
In that case you must propagate down any namespaces in the parent element to the root element of the subset.
With the inclusive canonicalization we are doing, all namespaces are propagated down, even if the subset does not use them.
<?xml version="1.0" encoding="UTF-8"?> <Envelope xmlns="http://www.example.com"> <Part xmlns:ab="http://www.ab.com"> <Doc Id="P666"> ... </Doc> <Signature xmlns="http://www.w3.org/2000/09/xmldsig#"> <SignedInfo> <Reference URI="P666"> ... </SignedInfo> ... <Signature> </Part> </Envelope>
<Doc xmlns="http://www.example.com" xmlns:ab="http://www.ab.com" Id="P666"> ... </Doc>
<SignedInfo xmlns="http://www.w3.org/2000/09/xmldsig#" xmlns:ab="http://www.ab.com"> <Reference URI="P666"> ... </SignedInfo>
Signatureelement overrides the "example.com" one for the c14n'd
- Superfluous namespace declarations are removed from each element.
A supefluous namespace is one that has already been declared in a direct parent of a sub-element, and that parent is in scope for the c14n'd part.
In the following example, we are canonicalizing the entire
Envelopeelement including the
Signature. The "ab" namespace in the
Docelement has already been declared with the same attribute value in the parent
Part, so it is removed. Similarly, the "xmldsig" namespace in the
SignedInfoelement has already been declared in its parent, so it is removed.
<?xml version="1.0" encoding="UTF-8"?> <Envelope xmlns="http://www.example.com"> <Part xmlns:ab="http://www.ab.com"> <Doc Id="P666" xmlns:ab="http://www.ab.com"> ... </Doc> <Signature xmlns="http://www.w3.org/2000/09/xmldsig#"> <SignedInfo xmlns="http://www.w3.org/2000/09/xmldsig#"> <Reference URI="P666"> ... </SignedInfo> ... <Signature> </Part> </Envelope>
<Envelope xmlns="http://www.example.com"> <Part xmlns:ab="http://www.ab.com"> <Doc Id="P666"> ... </Doc> <Signature xmlns="http://www.w3.org/2000/09/xmldsig#"> <SignedInfo> <Reference URI="P666"> ... </SignedInfo> ... <Signature> </Part> </Envelope>
xmlns="". Just don't use it!
- Lexicographic order is imposed on the namespace declarations and attributes of each element
The c14n ordering of attributes is as follows.
- The default namespace declaration
xmlns="...", if any, comes first.
- Namespace declarations, sorted by prefix (the part after "xmlns:").
- Unqualified attributes, sorted by name.
- Qualified attributes, sorted by namespace URI then name.
a:attr="...", because we read this as
<e xmlns="http://example.org" xmlns:a="http://www.w3.org" xmlns:b="http://www.ietf.org" attr="I'm" attr2="all" b:attr="sorted" a:attr="out" a:attr2="now"></e>
For an excellent explanation of the rules to sort attributes when canonicalizing your data for XML-DSIG, see Keith S. Beattie's article on attribute ordering KSB's XML C14N Notes.
- The default namespace declaration
Character references are replaced.
Special characters in attribute values and character content are replaced by character references
A reminder. An XML character reference begins with
"&#"and ends with a
";". These are meant to be used to input characters that aren't on your keyboard, or enter characters that already have a meaning in XML, like
For example, we could write
áto represent the letter
á, or we could equivalently write
The general rule: With a few exceptions (see below), all character references are changed to the actual UTF-8-encoded representation of the character.
So, for example, the character reference
áis replaced by the two bytes
0xc3 0xa1(which should show as
áin a UTF-8 compliant text editor). The character reference
@representing the "COMMERCIAL AT" symbol "@" is replaced by the byte
0x40, its UTF-8 encoding. The character reference
中for the chinese character 中 (U+4E2D) is replaced by its UTF-8 encoding, the three bytes
0xE4 0xB8 0xAD.
Before we go on, let's just remind ourselves of some XML terminology:
<tag>content</tag> <tag attribute-name="attribute-value">content</tag>The "content" is the text between the opening tag and the closing tag of an element. An "attribute-value" is the text between the "delimiters" (quotes) in an attribute.
The exceptions: The exceptions are the five XML predefined entities (
amp, lt, gt, apos, quot) and certain white space characters. The treatment is different depending whether they are in element content or in an attribute value. Here's a summary of the main rules.
- In all cases, the character
&is written as
<is written as
- The single quote/apostrophe
'is always left as is, and the entity
'is always changed to
', encoded as the byte
- The double quote
"is left as is in element content (the byte
0x22), but is changed to
"in an attribute value.
- The greater-than symbol
>is changed to
>in element content, but left as is in an attribute value (byte
- An isolated CR character (the byte
0x0dor the entity
or equivalent) in element content or an attribute value is always replaced by the character reference
with the hexadecimal value "D" in uppercase and no leading zeros.
- The whitespace characters TAB (0x09) and LF (0x0A) in an attribute value are replaced by the character references
", respectively. But they are left as is in element content.
- The correct c14n form of those few character references left in is "uppercase hexadecimal with no leading zeros". So
is correct, but
Make sure you have already changed your attribute value delimiters to double quotes before doing the above.Hint: If you can, avoid using these messy whitespace characters other than a space in attribute values. In fact, for attribute values, try and avoid all the cases in the exceptions above.
- In all cases, the character
- CDATA sections are replaced with their character content
CDATA sections are used to escape blocks of text containing characters which would otherwise be recognized as markup, such as left angle brackets "<" and ampersands "&".
This makes typing, say, verbatim XML like the example below much less messy. To canonicalize it we put it back in its messy form.
<doc> <![CDATA[ <contact> <name>Fred Bloggs</name> <coy>Branston & Pickle</coy> </contact> ]]> </doc>
<doc> <contact> <name>Fred Bloggs</name> <coy>Branston & Pickle</coy> </contact> </doc>
<![CDATA[and after the
]]>. These are retained in the c14n transform.
- Parsed entity references are replaced
<?xml version="1.0"?> <!DOCTYPE doc [ <!ENTITY ourname "DI Management"> ]> <doc>&ourname;</doc>
- Default attributes are added to each element
You should not come across a default attribute in practice. They need to be specified in a DTD and you are not using a DTD, are you?
If you do and you have a default attribute that is not in the original, you need to add it.
<!DOCTYPE doc [<!ATTLIST e1 attr CDATA "default">]> <doc> <e1 /> </doc>
<doc> <e1 attr="default"></e1> </doc>
- Our new program SC14N, a straightforward XML canonicalization utility performs the canonicalization (C14N) transformation you need to do when creating signed XML documents using XML-DSIG. That is, it does all the above for you automatically!
- The freeware Windows program hexdump is a simplified version of the Linux utility to display file contents in hexadecimal.
- digestvalue is a freeware command-line program that computes the digest value in base64 encoding of a file or list of files. The base64-encoded digest value is suitable for inserting in the <DigestValue> node of an XML-DSIG document, or can be used to compute the <SignatureValue>.
- Test your signed XML documents with Aleksey Sanin's XML Security Library Online XML Digital Signature Verifer.
- [XMLSPEC] Extensible Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation, 26 November 2008, <http://www.w3.org/TR/xml/>:
- [XML-C14N] RFC 3076 Canonical XML Version 1.0, March 2001, <https://tools.ietf.org/html/rfc3076>.
- [XML-DSIG] RFC 3275 XML-Signature Syntax and Processing, March 2002, <https://tools.ietf.org/html/rfc3275>.
- XML Signature WG
- XML-Signature Syntax and Processing <http://www.w3.org/TR/xmldsig-core/>
- Canonical XML Version 1.0, <http://www.w3.org/TR/2001/REC-xml-c14n-20010315/>
- Exclusive XML Canonicalization Version 1.0, <http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/>
For more information, or to comment on this page, please send us a message.
This page first published 28 June 2017. Last updated 16 July 2017.