Removing Illegal Characters in XML Documents
The W3C XML 1.0 specification identifies a range of valid characters.This article explains the meaning of this rule
and provides a C# method that locates any illegal characters. To begin with, the following lists the range of
valid XML characters. Any character not in the range is not allowed.
Hexidecimal | Decimal |
#x9 | #9 |
#xA | #10 |
#xD | #13 |
#x20-#xD7FF | #32-#55295 |
#xE000-#xFFFD | #57344-#65533 |
#x10000-#x10FFFF | #10000-#1114111 |
any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. |
The exception to this rule is that the content in a CDATA node may contain any character, with the exception of the sequence of
two right square brackets followed by the greater than symbol "]]>" because they are used to mark the end of a CDATA node.
The left angle bracket and the ampersand may appear in the content of CDATA but only in their literal form, not in their escaped form.
Certain other characters are commonly referred to as being illegal XML characters and this has led to some
misunderstanding. The less than symbol < is allowed only as part of the markup for XML tags. The ampersand
symbol & is allowed only when used to escape an XML entity(either one of the five pre-defined XML entities or
an entity that has been declared in Document Type Definition(DTD). Since there are accepted uses for these two
characters, they are not strictly speaking illegal XML characters. The less than and ampersand characters are two
of the five pre-defined XML entities. The other three being the greater than symbol, the quote and the apostrophe,
each of which are allowed in XML content without being expressed in entity notation.
<element> 1 > 2</element> is legal.
XML processors are required to convert the pre-defined entites to
their character representation without being defined anywhere in the XML document.
Now that the meaning of what characters are illegal in XML has been clarified, let's move on to handling illegal
characters when they do occur in an XML document. A Google search for "remove illegal XML characters" results in
plenty of code snippets. While most that I looked at appear to work, they all pass an XML string to a function that
checks if the string contains an illegal XML character. That is fine for small XML documents, but for large documents
I always read the file byte by byte which is orders of magnitude faster.
Two C# methods appear below. They are designed to be called from an application that reads the XML document
using a FileStream object and sequentially reads chunks of the file into a byte array.
The first method is IllegalChars and has three parameters: a byte array, the index in the array where an
ampersand occurs and a boolean value indicating if the XML file is unicode. It is called when the application
reading the byte array encounters an ampersand. IllegalChars tests the array at
indexOfAmpersand for the
existence of either a decimal or hexidecimal formatted as a numeric character reference, such as   or  .
When IllegalChars returns, the calling method can take
appropriate action, such as reporting the problem or replacing the illegal character with a legal string,
such as an underscore. Not included is code that detects an illegal occurence of either the less than
symbol or the ampersand. The reason is that they can only be accurately detected using a fully compliant
XML parser.
The second method, IllegalByte, has one parameter, currentByte, the integer value of the byte justread by
the FileStream object. This method checks if currentByte is within the range of allowed XML character
values and returns zero if it is. If it is not a legal character, the value of one is returned and the
calling program can take action appropriate to the application.
Submitted by Bill Conniff, Founder of Xponent, on October 4, 2010