Introduction to VTD - XML
Introduction to VTD - XML
Join the DZone community and get the full member experience.Join For Free
Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!
VTD-XML is a new, open-source, non-validating, non-extractive XML processing API written in Java. VTD-XML is a good alternative to Simple API for XML (SAX) and Document Object Model (DOM), as it does not force you to trade processing performance for usability.
Editors note: This article was contributed to us by Jeenus Johnson.
The Java-based, non-validating VTD - XML parser is faster than DOM and better than SAX. Unlike other XML processing technologies, VTD-XML is designed to be capable of random-access without incurring excessive resource overhead.
An important optimization feature of VTD-XML is non-extractive tokenization. Internally, VTD-XML retains the XML message in memory intact and un-decoded, and tokens represent tokens using starting offset and length exclusively. Tokenization of VTD-XML is based on the Virtual Token Descriptor (VTD) core binary encoding specification. A VTD record is a 64-bit integer that encodes the token length, starting offset, type and nesting depth of a token in XML.
Memory buffers can be allocated in bulk to store the VTD records, as the records are constant in length. This avoids the creation of a multitude of string/node objects usually associated with other XML processing technologies. As a result, both memory usage and object creation cost are greatly reduced by using VTD-XML, which leads to significantly higher processing performance. For example, on a 1.5 Ghz Athlon machine, VTD-XML delivers random access at a performance level of 25 to 35 MB/sec, outperforming most SAX parsers with null content handlers. An in-memory VTD-XML document typically consumes only 1.3 to 1.5 times the size of the XML document.
VTD-XML provides several benefits for software developers. For example, if you require a processing model to start work on a project involving XML. DOM is slow and consumes too much memory, particularly for large documents. SAX difficult to use especially for XML documents with complex structures.
As a result, the VTD-XML provides an alternative option, as the features of VTD-XML does not force you to trade processing performance for usability. The random-access capability of VTD-XML provides the best possible performance. Even though SAX is fast due to ifs forward only nature, it does not suit for all the conditions.
In some situations, you perform lots of buffering to extract the data needed, while in others, you may have to repeat SAX parsing on the same document multiple times. Irrespective of what you do, SAX programming usually results in ugly and unmaintainable code, while the performance benefit over DOM is not always significant. The VTD-XML enables you to simultaneously achieve ease-of-use and high-performance. Also the performance benefit of the VTD-XML over DOM is substantial.
VTD-XML can be used for an XML project, only if the two criteria are met. The first criteria is that the current version of VTD-XML does not support entity declarations in document type definitions (DTDs). The VTD-XML recognizes only the five built-in entities such as &s;, ', <, >, and ". The VTD-XML works well when Simple Object Access Protocol (SOAP), Resource Description Framework (RDF), Financial Information Exchange Markup Language (FIXML), or Really Simple Syndication (RSS) are used in the XML project. The next criterion is that the VTD-XML's internal parsed representation of XML is slightly larger than the XML, which as a result demands sufficient RAM. To provide true, random access to the entire document, the document needs to be placed in memory. When both the criteria are met, the VTD-XML is the most efficient XML processing API.
The Java API of VTD-XML consists of three essential components which include VTDGen (VTD generator) that encapsulates the parsing routine that produces the internal parsed representation of XML, the VTDNav (VTD navigator) which is a cursor-based API that allows for DOM-like random access to the hierarchical structure of XML, and the Autopilot which is the class that allows for document-order element traversal.
The following steps need to be performed to use VTD-XML for processing an XML document either from disk or via HTTP.
- The first step is to find out the length of the XML document, allocate adequate memory big enough to hold the document, and then read the entire document into memory.
- The next step is to create an instance of VTDGen and assign the byte array to it using setDoc().
- The final step is to call parse(boolean ns), to generate the parsed XML representation. When ns is set to true, subsequent document navigation is namespace aware. If parsing succeeds, you can retrieve an instance of VTDNav by calling getNav().
At the onset of navigation, the cursor of the VTDNav instance points at the root element of the XML document. You can use one of the overloaded versions of toElement() function, to move the cursor manually to different positions in the hierarchy. The toElement() function when declared as toElement(int direction) takes an integer as the input, to indicate the direction in which the cursor moves. Defined as class variables of VTDNav, the six possible values of this integer are ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING, and PREV_SIBLING. Each has its respective acronym such as R, P, FC, LC, NS, and PS. The method toElement() returns a Boolean value indicating the status of the operation. The toElement() returns true when the cursor is moved successfully. When the cursor is moved to a non-existent location, for example, the first child of a childless element, then the cursor does not move and the toElement() returns false.
The method getAttrVal(String attrName) retrieves the attribute value of the element at the cursor position.
Similarly, the getText() method retrieves the text content of the cursor element. In addition, you can use the toElementNS() and getAttrValNS() methods to navigate the document hierarchy in a namespace-aware fashion, if the namespace is turned on during parsing. Autopilot is the other mode of navigation. An instance of Autopilot can automatically move the cursor through the node hierarchy in document order. To use Autopilot, first you need to call the constructor, which accepts an instance variable of VTDNav as the input. Then, you need to call the selectElement() or selectElementNS() method, to specify the descendent elements to be filtered out. Whenever this is done, each call to the iterate() method moves the cursor to the next matching element.
What Makes VTD-XML Unique
Now let us see some of the unique properties of VTD-XML compared to other similar XML APIs, such as DOM and XMLCursor. The hierarchy of VTD-XML consists exclusively of element nodes. This is very different from DOM, which treats every node, whether it is an attribute node or a text node, as a part of the hierarchy. In VTD-XML, every instance of VTDNav has only one cursor. The cursor can be moved back and forth in the hierarchy, but you cannot duplicate it. However, you can temporarily save the location of the cursor on a global stack. VTDNav has two stack access methods which include Calling push() which saves the cursor state and Calling pop() which restores the cursor state. For example, consider that you are somewhere in the element hierarchy and you want to move to a different area of the document after saving the current location and then continue at the saved point. To accomplish this task, you need to first push() the location onto the stack. After moving the cursor to a different part of the document, you can very quickly jump back to the saved location by popping it off the stack.
One of the most unique aspects of VTD-XML that distinguish it from any other XML processing APIs, is its non-extractive tokenization based on Virtual Token Descriptor. Non-extractive parsing enables you to achieve optimal processing and memory efficiency in VTD-XML. VTD-XML manifests this non-extractiveness in the following ways. Most of the member methods of VTDNav, such as getAttrVal(), getCurrentIndex(), and getText() return an integer. This integer is a VTD record index that describes the token as requested by the calling functions. VTD-XML produces a linear buffer filled with VTD records, after parsing. You can access any VTD record in the buffer if you know its index value, as all the VTD records are have the same length. In addition, the VTD records cannot be addressed using pointers, as the records are not objects. When a VTDNav function does not evaluate to any meaningful value, it returns -1 which is more or less equivalent to a NULL pointer in DOM.
VTD-XML implements its own set of comparison functions that directly operate on VTD records, as the parsing process does not create any string objects. For example, the matchElement() method of VTDNav, tests whether the element name, which effectively is the VTD record of the cursor, matches a given string. Similarly, the matchTokenString(), matchRawTokenString(), and matchNormalizedTokenString() methods of VTDNav perform a direct comparison between a string and a VTD record. This is advantageous as you need not pull tokens out into string objects, which are expensive to create, especially when you create lots of them. Bypassing excessive object creation is the main reason why VTD-XML significantly outperforms DOM and SAX. VTD-XML can also implement its own set of string-to-numeric data conversion functions that operate directly on VTD records. VTDNav has four member methods which include parseInt(), parseLong(), parseFloat() and parseDouble(). These functions take a VTD record index value and convert it directly into a numeric data type.
Opinions expressed by DZone contributors are their own.