Over a million developers have joined DZone.

Conveniently Processing Large XML Files with Java

Processing a large XML file using a SAX parser still requires constant low memory. To resolve this problem we need to have a closer look at our XML input data.

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

When processing XML data it's usually most convenient to load the whole document using a DOM parser and fire some XPath-queries against the result. However, since we're building a multi-tenant eCommerce plattform we regularly have to handle large XML files, with file sizes above 1 GB. You certainly don't want to load such a beast into the heap of a production server, since it easily grows up to 3GB+ as DOM representation.

So what to do? Well, SAX to the rescue! Processing a large XML file using a SAX parser still requires constant (low) memory, since it only invokes callback for detected XML tokens. But, on the other hand, parsing complex XML really becomes a mess.

To resolve this problem we need to have a closer look at our XML input data. Most of the time, at least in our cases, you don't need the whole DOM at once. Say your importing product informations, it sufficient to look at one product at a time. Example:

<nodes>
    <node>
        <name>Node 1</name>
        <price>100</price>
    </node>
    <node>
        <name>Node 2</name>
        <price>23</price>
    </node>
    <node>
        <name>Node 3</name>
        <price>12.4</price>
        <resources>
            <resource type="test">Hello 1</resource>
            <resource type="test1">Hello 2</resource>
        </resources>
    </node>
</nodes>

 

 When processing Node 1, we don't need access to any attribute of Node 2 or three, respectively when processing Node 2, we don't need access to Node 1 or 3, and so on. So what we want is a partial DOM, in our example for every <node>.


What we've therefore built is a SAX parser, for which you can specify in which XML elements you are interested. Once such an element starts, we record the whole sub-tree. When this completes we notify a handler which then can run XPath expressions against this partial DOM. After that, the DOM is released and the SAX parser continues.


Here is a shortened example of how you could parse the XML above - one "<node>" at a time:

XMLReader r = new XMLReader();
   r.addHandler("node", new NodeHandler() {

     @Override
     public void process(StructuredNode node) {

       System.out.println(node.queryString("name"));
       System.out.println(node.queryValue("price").asDouble(0d));
     }
   });

   r.parse(new FileInputStream("src/examples/test.xml"));


The full example, along with the implementation is open source (MIT-License) and available here:

https://github.com/andyHa/scireumOpen/tree/master/src/com/scireum/open/xml

https://github.com/andyHa/scireumOpen/blob/master/src/examples/ExampleXML.java


We successfully handle up to five parallel imports of 1GB+ XML files in our production system, without measurable heap growth. (Instead of using a FileInputStream, we use JAVAs ZIP capabilities and directly open and process ZIP versions of the XML file. This shrinks those monsters down to 20-50MB and makes uploads etc. much easier.)

Learn how you can modernize your data warehouse with Apache Hadoop. View an on-demand webinar now. Brought to you in partnership with Hortonworks.

Topics:
java,xml,bmecat

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}