Xponent's Mostly XML Blog

Removing Illegal Characters in XML Documents

The W3C XML 1.0 specification identifies a range of valid characters.This article explains the meaning of this rule and provides a C# method that locates any illegal characters. To begin with, the following lists the range of valid XML characters. Any character not in the range is not allowed.

Hexidecimal	Decimal
#x9	#9
#xA	#10
#xD	#13
#x20-#xD7FF	#32-#55295
#xE000-#xFFFD	#57344-#65533
#x10000-#x10FFFF	#10000-#1114111
any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.

The exception to this rule is that the content in a CDATA node may contain any character, with the exception of the sequence of two right square brackets followed by the greater than symbol "]]>" because they are used to mark the end of a CDATA node. The left angle bracket and the ampersand may appear in the content of CDATA but only in their literal form, not in their escaped form.

Certain other characters are commonly referred to as being illegal XML characters and this has led to some misunderstanding. The less than symbol < is allowed only as part of the markup for XML tags. The ampersand symbol & is allowed only when used to escape an XML entity(either one of the five pre-defined XML entities or an entity that has been declared in Document Type Definition(DTD). Since there are accepted uses for these two characters, they are not strictly speaking illegal XML characters. The less than and ampersand characters are two of the five pre-defined XML entities. The other three being the greater than symbol, the quote and the apostrophe, each of which are allowed in XML content without being expressed in entity notation. <element> 1 > 2</element> is legal. XML processors are required to convert the pre-defined entites to their character representation without being defined anywhere in the XML document.

Now that the meaning of what characters are illegal in XML has been clarified, let's move on to handling illegal characters when they do occur in an XML document. A Google search for "remove illegal XML characters" results in plenty of code snippets. While most that I looked at appear to work, they all pass an XML string to a function that checks if the string contains an illegal XML character. That is fine for small XML documents, but for large documents I always read the file byte by byte which is orders of magnitude faster.

Two C# methods appear below. They are designed to be called from an application that reads the XML document using a FileStream object and sequentially reads chunks of the file into a byte array.

The first method is IllegalChars and has three parameters: a byte array, the index in the array where an ampersand occurs and a boolean value indicating if the XML file is unicode. It is called when the application reading the byte array encounters an ampersand. IllegalChars tests the array at indexOfAmpersand for the existence of either a decimal or hexidecimal formatted as a numeric character reference, such as &#x20 or &#32. When IllegalChars returns, the calling method can take appropriate action, such as reporting the problem or replacing the illegal character with a legal string, such as an underscore. Not included is code that detects an illegal occurence of either the less than symbol or the ampersand. The reason is that they can only be accurately detected using a fully compliant XML parser.

code1

The second method, IllegalByte, has one parameter, currentByte, the integer value of the byte justread by the FileStream object. This method checks if currentByte is within the range of allowed XML character values and returns zero if it is. If it is not a legal character, the value of one is returned and the calling program can take action appropriate to the application.

code2

Submitted by Bill Conniff, Founder of Xponent, on October 4, 2010