DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Coding
  3. Languages
  4. Memory Efficient XML Processing not only with DOM

Memory Efficient XML Processing not only with DOM

Peter Karussell user avatar by
Peter Karussell
·
May. 01, 10 · Interview
Like (0)
Save
Tweet
Share
13.16K Views

Join the DZone community and get the full member experience.

Join For Free

How can I efficiently parse large xml files which can be several GB large? With SAX? Hmmh, well, yes: you can! But this is somewhat ugly. If you prefer a better maintable approach you should definitely try joost which does not load the entire xml file into memory but is quite similar to xslt.

But how can I do this with DOM or even better dom4j, if you only have 50 MB or even less RAM? Well, this is not always possible, but under some circumstances you can do this with a small helper class. Read on!

E.g.you have the xml file

<products>
<product id="1"> CONTENT1 .. </product>
<product id="2"> CONTENT2 .. </product>
<product id="3"> CONTENT3 .. </product>
...
</products>

Then you can parse it product by product via:

List<String> idList = new ArrayList<String>();
ContentHandler productHandler =
new GenericXDOMHandler("/products/product") {
public void writeDocument(String localName, Element element)
throws Exception {
// use DOM here
String id = element.getAttribute("id");
idList.add(id)
}
}
GenericXDOMHandler.execute(new File(inputFile), productHandler);

How does this work? Every time the SAX handler detects the <product> element it will read the product tree (which is quite small) into RAM and call the writeDocument function. Technically we have added a listener to all the product elements with that and are waiting for ‘events’ from our GenericXDOMHandler. The code was developed for my xvantage project but is also used in production code on big files:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

/**
* License: http://en.wikipedia.org/wiki/Public_domain
* This software comes without WARRANTY about anything! Use it at your own risk!
*
* Reads an xml via sax and creates an Element object per document.
*
* @author Peter Karich, peathal 'at' yahoo 'dot' de
*/
public abstract class GenericXDOMHandler extends DefaultHandler {

private Document factory;
private Element current;
private List<String> rootPath;
private int depth = 0;

public GenericXDOMHandler(String forEachDocument) {
rootPath = new ArrayList<String>();
for (String str : forEachDocument.split("/")) {
str = str.trim();
if (str.length() > 0)
rootPath.add(str);
}

if (rootPath.size() < 2)
throw new UnsupportedOperationException("forEachDocument"+
+" must have at least one sub element in it."
+ "E.g. /root/subPath but it was:" + rootPath);
}

@Override
public void startDocument() throws SAXException {
try {
factory = DocumentBuilderFactory.newInstance().
newDocumentBuilder().newDocument();
} catch (Exception e) {
throw new RuntimeException("can't get DOM factory", e);
}
}

@Override
public void startElement(String uri, String local,
String qName, Attributes attrs) throws SAXException {

// go further only if we add something to our sub tree (defined by rootPath)
if (depth + 1 < rootPath.size()) {
current = null;
if (rootPath.get(depth).equals(local))
depth++;

return;
} else if (depth + 1 == rootPath.size()) {
if (!rootPath.get(depth).equals(local))
return;
}

if (current == null) {
// start a new subtree
current = factory.createElement(local);
} else {
Element childElement = factory.createElement(local);
current.appendChild(childElement);
current = childElement;
}

depth++;

// Add every attribute.
for (int i = 0; i < attrs.getLength(); ++i) {
String nsUri = attrs.getURI(i);
String qname = attrs.getQName(i);
String value = attrs.getValue(i);
Attr attr = factory.createAttributeNS(nsUri, qname);
attr.setValue(value);
current.setAttributeNodeNS(attr);
}
}

@Override
public void endElement(String uri, String localName,
String qName) throws SAXException {

if (current == null)
return;

Node parent = current.getParentNode();

// leaf of subtree
if (parent == null)
current.normalize();

if (depth == rootPath.size()) {
try {
writeDocument(localName, current);
} catch (Exception ex) {
throw new RuntimeException("Exception"+
+" while writing one element of path:" + rootPath, ex);
}
}

// climb up one level
current = (Element) parent;
depth--;
}

@Override
public void characters(char buf[], int offset, int length)
throws SAXException {
if (current != null)
current.appendChild(factory.createTextNode(
new String(buf, offset, length)));
}

public abstract void writeDocument(String localName, Element element)
throws Exception {
}

public static void execute(File inputFile,
ContentHandler handler)
throws SAXException, FileNotFoundException, IOException {

execute(new FileInputStream(inputFile), handler);
}

public static void execute(InputStream input,
ContentHandler handler)
throws SAXException, FileNotFoundException, IOException {

XMLReader xr = XMLReaderFactory.createXMLReader();
xr.setContentHandler(handler);
InputSource iSource = new InputSource(new InputStreamReader(input, "UTF-8"));
xr.parse(iSource);
}
}

PS: It should be simple to adapt this class to your needs; e.g. using dom4j instead of DOM. You could even register several paths and not only one rootPath via a BindingTree. For an implementation of this look at my xvantage project .

PPS: If you want to process xpath expressions in the writeDocument method be sure that this is not a performance bottleneck with the ordinary xpath engine! Because the method could be called several times. In my case I had several thousand documents, but jaxen solved this problem!

PPPS: If you want to handle xml writing and reading (‘xml serialization’) from Java classes check this list out!

 

From http://karussell.wordpress.com/2010/04/29/memory-efficient-xml-processing-not-only-with-dom/

XML Memory (storage engine) Processing

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • OWASP Kubernetes Top 10
  • Secure APIs: Best Practices and Measures
  • When to Choose Redpanda Instead of Apache Kafka
  • 5 Software Developer Competencies: How To Recognize a Good Programmer

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: