Conquering Memory and Performance Problems With Large XML Treeviews
May 26, 2009
Xponent developed a radically new type of treeview that completely eliminates
the memory and performance problems that result from displaying a treeview with a large number of nodes. The architecture
does not implement or build upon any existing treeview technology, including ones that do not hold the entire XML in
memory. Conventional treeview controls hold the entire treeview in memory, such that the number of nodes that may be
added is strictly a function of the amount of available memory. As nodes are added and available memory decreases, so
does performance.
For developers reading this, we do not use a pre-built treeview control. Our control is a class that inherits from the Panel class
and adds to the panel a textbox for each XML node. So what appears to be a treeview is actually just a bunch of textboxes with plus and minus sign images for
expanding and collapsing elements. The number of textboxes added is very small and is the number that are visible in the panel at its current size,
typically twenty-five to fifty textboxes. Resizing the control by dragging the bottom frame down but not releasing it reveals there is no additional
XML content -it is not filled in with more XML nodes until the frame is released and the resize event is completed.
Existing Solutions
A widely used strategy, dynamic loading, has been around for at least ten years. It uses a conventional treeview control,
but loads only the nodes in the first hierarchy under the root(depth=1), but none of their child nodes, which are
dynamically loaded and unloaded upon demand when a parent node is expanded or collapsed. A problem with this design
occurs if an element has a large number of child nodes, because all of the child nodes have to be read to ensure that
each loaded element includes its end tag. This can significantly degrade performance. Similarly, expansion can take a
long time to read and load all the child nodes. As a demonstration,
download the simple.xml file in zip format(1.5MB zipped, 21 MB XML)
and try loading it into a treeview with whatever XML software you have. Simple.xml has just one child node under the root,
but it has one million elements(grandchildren).
Check out the data files from stackoverflow.com
referenced on our Links page. These files have tens of millions of elements all at the first
level under the root. XMLMax is the only XML editor that can open files this large.
Dynamic loading requires that the displayed elements are
initially collapsed since child nodes are not loaded. This may detract from the
usability of the treeview for some users. It also presents the problem of preserving the collapsed-expanded state of an
element; when a user collapses an element, all its child nodes are removed from the tree and the expanded
state of its descendant elements is lost. Thus, when the element is expanded again
all its child nodes are collapsed, even if some or all of them had ben left expanded.
Notepad and some other text editors display XML in a collapsable
tree. They do not use dynamic loading but also do not use an XML parser so they cannot detect if the file has an XML
syntax error. Given a big enough XML file
text editors will not use a treeview. Notepad++ took over one minute to read and fully load simple.xml, while XMLMax
took just three seconds.
The Ultimate Treeview
Xponent's treeview works by adding only the number of nodes it takes to fill the treeview panel, displaying them fully expanded.
It can display any fragment of an XML document even if the fragment contains unbalanced elements, i.e., not all the
elements have matching start tags and end tags.
It does not have to read the XML until it reaches the end of a balanced fragment. Thus, if the treeview window
displays twenty-five XML nodes, then twenty-five nodes are read from the XML and added to the tree which is then ready for
user interaction.
When an XML file is first opened, the entire file is parsed(with the .Net XmlReader). The user is able to navigate the treeview for the
portion of the XML that has been read while the remainder of the file is read in the background(asynchronously).
During this initial parsing, the file is segmented at arbitrary elements(the segments do not need to be balanced as
mentioned above) and only one segment is held in memory.
Because so few nodes are in the treeview at a given time, the
amount of memory needed to display and navigate any XML is strictly a factor of the content of the XML.
Approximately ten thousand XML nodes are held in a memory buffer. Thus, the memory requirement is the size in bytes of the XML
contained in the buffer, plus a small amount of overhead. It would be unusual for the memory requirement to exceed
twenty megabytes unless the XML content includes very large text nodes. Each time the tree is navigated, such as page
down, page up, each textbox is updated with the next set of nodes. This completely
eliminates both the memory and performance problems, but it introduces a number of new ones.
Even with such a custom treeview, Xponent had to resolve a number of issues:
- Locating the child nodes of an element when it is expanded or collapsed.
- Tracking the expansion state of each element such that it is re-displayed in its prior state.
- Tracking the hierarchical depth of each node so that it may be properly indented in the treeview.
- Fast, bi-directional paging and scrolling from any location in the XML.
XML data present the addtional problem of determining when XML nodes need to be added to the treeview in response to
user navigation.
To address these issues, Xponent designed a buffering and caching system based on accessing and tracking byte offsets
into the raw XML. The only drawbacks of the design are:
- There is no smooth scrolling while the scrollbar is dragged.
This is because the treeview contains only the number of XML nodes that fit on one page. In the case of a scroll drag event, it cannot
be re-populated until the drag event is completed.
- There may be a brief delay during user navigation when the end of a buffer is reached.
This is because the buffer must be cleared and re-loaded. The delay ranges from a small fraction of one second to two
seconds. The delay for a given datasource is constant and does not vary with how far the treeview is scrolled. Thus,
dragging the thumb from the beginng of the tree to the end should not take more than two seconds. Paging and scrolling
within the range of the current buffer takes a small fraction one second so there is no noticeable delay. Buffer size
ranges from ten to fifty thousand nodes.