Xponent's Mostly XML Blog

XML Parsers And Well-Formed Errors

The W3C XML 1.0 Specification requires an XML document to be "Well-Formed" which basically means that it has a correct syntax. This article addresses how an XML parser locates well-formed errors. The list of syntax rules is rather lengthy. Some of the basic rules are that the document must have a single "Root" node, the element tags are properly nested, and that tag names are case sensitive. The W3C(www.w3c.org) is the standards-setting organization that developed XML and related specifications.

Why Browsers and XML Editors Stop When They Hit A Well-formed Error

Why not read the entire document and then report all the errors? The structure of XML makes it impossible to accurately read any farther in the document once an error has been encountered. While reading this article, keep in mind that an XML parser, which is the software that reads and analyzes the xml, reads in a forward direction only. It does not retain much information about what has previously been read, primarily for memory and performance considerations. The XML parser error messages reported here are from Microsoft's XmlReader class of the .NET Framework.

Parsing XML is somewhat like walking down a staircase in the dark. One carefully takes a step at a time, checking to see if the next step is broken or missing. Any attempt to step over a broken or missing step, could be disastrous. It is simply too dangerous to continue until the step is repaired. When an XML parser encounters a well-formed error, there is no way to reliably evaluate what lies ahead. Consider the following XML, bearing in mind XML tag names may not begin with a number so 1Foo is a well-formed error.

example1

Most XML parsers report the line and character position of the offending character when a well-formed error is encountered. In the example above, the error would be reported as:

Name cannot begin with the '1' character, hexadecimal value 0x31. Line 2, position 2.

If a parser were to report the error and proceed, how would it know where the 1Foo element ends? The end tag may have the same error, or it may have a different error, or it may be missing. Should the parser assume the Foo element should be the end tag for 1Foo or assume it is the end tag of another element that is missing its start tag? Assumptions simply cannot be made as they could result in cascading misinterpretations resulting in the reporting of additional errors, which may or may not be errors.

Locating Well-formed Errors>

Consider the following XML file:

example1

Most xml parsers would report an error similar to the following:

End tag 'allbooks' does not match the start tag 'book'. Line 5, Position 3.

Notice that the reported line is the root end tag. The book element was read and the parser reached the end of the file with no end element found for it. In this small file it is easy to see that the problem may be fixed by changing the second book element into an end element. But the error message would not be much help in a large XML file where there may be thousands of book elements and the error occurs in the middle of the document.

Even more problematic is an XML file containing no carriage returns or line feeds such as the following:

example1

The last element before the root end tag could either be missing markup to make it an end tag for element "e", or it could be an element without an end tag. An XML parser will report a well-formed error. The line position will be one since it is a single-line file, but the line position reported will be that of the root end tag, regardless of where the actual problem tag is in the file. Thus, the problem tag cannot be located by computer logic. The XML parser reports an error like the following:

The 'e' start tag on line 1 does not match the end tag of 'root'. Line 1, position 50.

Position 50 is the root end tag. The parser does not realize that the last element is missing its end tag until it has reached the end tag of the root, which it does recognize as the last element in the document, so it stops there. What do you suppose the parser reports if it is the second element that is missing its end tag, rather than the last one? The error message is identical -the parser still does not realize an error exists until it reaches the root end tag. An XML parser does not remember much and cannot look back. It does not assume that a start tag should be an end tag and report the error position at that point, but rather that it is the start of a child element, so it continues reading until it reaches the end.

A human can easily see where the problem is because this is a tiny XML document. What if the XML had a few thousand elements? Even if viewed in a tree, how long would it take to locate the error if the problem tag is in the middle of a huge XML file?

As far as I know, no XML parser reports the location of the problem tag in this particular scenario. If you know of an XML parser that does, please advise. This does not preclude applications that use an XML parser from implementing their own strategy to accomplish this. It would require a method of tracking element tags along with their positions.

Submitted by Bill Conniff, Founder of Xponent, on February 10, 2010