Extracting Data From Very Large XML Files With X-definition

X-definition is an open-source Java API that can be used to extract data from XML files regardless of their size. In this tutorial, see X-definition in action.

Curt Selak

Jan. 03, 22 · Tutorial

Likes (2)

Comment

Save

4.4K Views

X-definition is an open-source Java API that can be used to extract data from XML files regardless of their size. It will not compel the Java Virtual Machine to complain that it is out of heap memory, nor does it even require that your Java code step through the parts of your XML in the order of their occurrence until the location of the data you need is reached. It requires little more than a markup model of your XML document, and about 90 to 120 seconds of processing time for each gigabyte of XML data.

In this article, we'll download a modest (2.5 GB) file from data.discogs.com and extract data from it using a minimum of code. Our X-definition instructions will amount to the following:

    XML
   
 

   <?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther; occurs *; finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist xd:script="options ignoreOther;occurs *;">
<id>
onTrue out("\t" + getText());
</id>
<name>
onTrue out("\t" + getText());
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>
  

In order to call the X-definition API, all we'll need is this code:

    Java
   
 

   import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;

public class Xdefinition {
		
	public static void main(String[] argv) throws Exception
		{
			Xdefinition xdefinition = new Xdefinition();
		}
		{
			XDPool xpool = XDFactory.compileXD(null,"markup.xdef");
			XDDocument xdoc = xpool.createXDDocument();
			xdoc.xparse("discogs_20211201_masters.xml",new NullReportWriter(false));
		}
}
  

There will be nothing else to write, compile, or run.

Downloading the XML Document

From the download site, I used a file obtained from the 2021 directory: the name of the archive is discogs_20211201_masters.xml.gz.

It's the largest file (2.5 GB, as I mentioned earlier) but one for the month in question. The largest (discogs_20211201_releases.xml.gz) exceeds 60 GB when decompressed.

Downloading X-definition

You can download X-definition here. The only file that you'll need in your classpath is xdef-41.0.2.jar, located in the archive's xdef directory.

The archive also contains API documentation, source code (including the code for both the API's interfaces and Syntea's implementations of same; the JavaDoc furnished is currently only for the former), and various user manuals in PDF format. The user manual to which I'll refer throughout this article is xdef-4.1.pdf (Language Description), which can also be found on GitHub (and will be linked to throughout when opportunity knocks).

Preparing the .xdef File

We saw the .xdef file in its entirety a moment ago: it bore an XML declaration and consisted of markup written in XML. In the manual, either it or an xd:def element is frequently referred to as an "X-definition".

What that more or less indicates is that it is a [perhaps abbreviated] model of an XML document. "Model" is a term the manual uses at least once. The general idea is to offer a manner of representing an existing XML document that parallels the human-readable document to hand a little more obviously than an XML schema definition might.

Indeed, as we begin writing our .xdef file, we can proceed quite as we would if were we making an empty copy of the document we downloaded. It's not readily done when the document is far too large to open in a text editor, but it's not impossible. On a Linux system, we can open the XML file with the less command and discover the document element, which is called masters. Once we have done this, we have enough information to write our .xdef file's enclosing element:

    XML
   
   <?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
</xd:def>

We're not required to do more than identify the document root by means of the root attribute (and also declare the X-definition namespace in order to use the prefix xd throughout). There is an optional name attribute that we can use to identify the xdef element itself, but we would only need it if we were incorporating more than one in a project. All we'll need when we call the API later will be the path to our .xdef file.

Appendix A of the manual contains a description in the standard (Backus-Naur) notation of the xd:def element and the other elements and attributes that are permitted in an .xdef file.

When we glance at our XML document, it becomes apparent that the masters element encloses master elements. The following is to some extent representative of a master element:

    XML
   
 

   
<master id="2407459">
<main_release>6201234</main_release>
<images>
<image type="primary" uri="" uri150="" width="600" height="600"/>
<image type="secondary" uri="" uri150="" width="600" height="597"/>
</images>
<artists>
<artist>
<id>4054418</id>
<name>Lee Caron</name>
<anv></anv>
<join></join>
<role></role>
<tracks></tracks>
</artist>
</artists>
<genres>
<genre>Rock</genre>
<genre>Pop</genre>
</genres>
<styles>
<style>Rhythm & Blues</style>
<style>Rock & Roll</style>
</styles>
<year>1955</year>
<title>Back To An Empty Room</title>
<data_quality>Correct</data_quality>
<videos>
<video src="https://www.youtube.com/watch?v=2h-xb5bUhTE" duration="185" emb
ed="true">
<title>Back To An Empty Room by Lee Caron</title>
<description>Please leave comment.</description>
</video>
</videos>
</master>
  

It isn't completely so: for instance, if we look further at the document, we'll learn that an artists element sometimes encloses more than one artist element. Also, not each and every master element contains a videos element. As long as we appreciate that there can be consecutive artist elements, however, we'll find that what we know thus far about our XML document is sufficient.

We'll focus on our master, main_release, and artists elements. Our objective will amount to nothing more than a simple text table. Each row will contain the master element's id attribute, the main_release element's content, and the content of each id and name element that is a child of an artist element.

It's not a challenge as long as our .xdef file can identify the elements or attributes in which we're interested. The first of these is the element our xd:def element's root attribute designated earlier on:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
</masters>
</xd:def>

We've incorporated masters in our .xdef file in such a way that it will enclose the other elements, as it does in our XML document. However, what we've typed is less simple than it appears, as being equivalent to:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters xd:script="required">
</masters>
</xd:def>

We'll use the auxiliary xd:script attribute elsewhere, but where it isn't used or doesn't contain what we will soon find is a quantifier (or, as above, a keyword that is equivalent to a quantifier, in this case to occurs 1), X-definition will default to required, which translates to "occurs once and only once".

Because the master element not only contains an attribute (id) that we will need, but also encloses other elements the respective content of which we will need, it will demand more of our attention than the other elements shall. Uniquely for our .xdef file, it will have one attribute corresponding to an attribute in the XML document, as well as one auxiliary xd:script attribute:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script=''>
</master>
</masters>
</xd:def>

The values will be single-quoted in order to accommodate double quotes and will consist of what X-definition's developers call X-script.

The first of the two attributes to begin thinking about is xd:script. Please bear in mind: once both have been filled in, much of what we really need to do where the .xdef file is concerned will be finished.

I noted earlier that the master element encloses other elements (main_release, artists) that we will need. Not all shall be needed, however, and if we include an options section in our xd:script attribute's value, we can tell X-definition to take account of only those we mention in our .xdef file:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script='options ignoreOther;'>
</master>
</masters>
</xd:def>

We'll need to do that any time that the element we are on contains child elements we wish to ignore. Otherwise, X-definition will generate error output (in our case invisibly, as we'll determine when calling the API in our Java code) whenever elements we ignore in our .xdef file are detected in our XML document. The options section is discussed in section 4.1.22 of the manual.

What we need to consider next is the number of times the master element potentially occurs. Where masters were concerned, we declined to do so because X-definition's default (understood as the required keyword, equivalent to occurs 1) already was acceptable. If we do the same here, just one master element will be processed (and an error message will be generated for each of the rest). We can indicate that master occurs zero or more times:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script='options ignoreOther;occurs *;'>
</master>
</masters>
</xd:def>

A quantifier like the above (occurs *) will be needed every time that we know an element (e.g., artist) will occur more than once. Section 4.1.10 of the manual discusses quantifiers.

The last thing that we need to consider (for the moment) about our master element's xd:script attribute is to do with the very large number of master elements — more than 1.5 million — at our disposal. In order for each master element to be removed from memory once X-definition is ready to move along to the following master element, the forget keyword needs to be deployed at the very tail of the xd:script attribute's value:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script='options ignoreOther;occurs *;forget;'>
</master>
</masters>
</xd:def>

Were our XML document to have had any other elements on the same level in the tree as master, and were they to have been referred to as well in our .xdef file, the forget keyword would have needed to be used in their context as well as in that of master. Otherwise, the heap memory available to the JVM could ultimately be exhausted. The forget keyword is discussed in section 4.1.9 of the manual.

Our X-script thus far has been assigned to a part of a document (an element) by means of an auxiliary attribute (xd:script). We can do similarly for an attribute or a text node known to exist in our XML document by simply entering the X-script in our .xdef file's corresponding location: for our master element's id attribute, that location would be within the single quotation marks. The id attribute's value as found in our XML document is the source of the initial cell of each row of our text table:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;forget;'>
</master>
</masters>
</xd:def>

The onTrue keyword refers to an event. When X-definition alights on a master element, its id attribute becomes available. In the present case, the onTrue event occurs if the id attribute bears a string of any length as its value. Were it indicated elsewhere in the X-script that the value of id need only ever be an integer, and were X-definition unable to interpret that value as such, the onFalse event would occur instead. Events in X-script have corresponding actions. Here, the X-script defines the action as writing the id attribute's value, followed by a tab character, to standard output. Events are discussed in sections 2.10 and 4.1.9 of the manual; the getText() and out() functions are in section 4.1.19. Section 7.1 (6.1 in the version on GitHub) discusses the order in which X-definition processes the parts of an XML document.

Prior to going away from the master element, X-definition will extract the rest of the data for our table row from its child elements. We can terminate our table row once that it has finished with the main_release and artists elements — which is to say, once that the finally event has occurred in the master element's context — by revisiting our xd:script auxiliary attribute and adding code to it:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
</master>
</masters>
</xd:def>

A carriage return will terminate each row.

Partly because our text table presents its data in the order in which it occurs in the XML document — we therefore don't need to define variables for storing any — remarkably little code is left to write. We won't need to refer to any more attributes in the XML document, so adding our remaining elements to what we have got so far is really easy:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
</main_release>
<artists>
<artist>
<id>
</id>
<name>
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>

The text content of the main_release, id, and name elements will furnish the rest of the data for our table row.

The X-script we'll add to main_release is identical to that used earlier for the master element's id attribute, except for the tab character (which at this point would prove redundant):

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist>
<id>
</id>
<name>
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>

Where we added it before where the master element's id attribute's value would have appeared in the XML document, we've added it above where the main_release element's text content would go. The getText() function will obtain the main_release element's text content, and the out() function will send it to standard output.

The way we deal with the artist tag parallels what we did earlier in the master element's xd:script tag:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist xd:script="options ignoreOther;occurs *;">
<id>
</id>
<name>
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>

As before, leaving out the ignoreOther option will assure copious error output, because we don't plan on taking account of every last one of the artist element's children, and omitting the quantifier (occurs *) will cause still more error output and also result in only the first artist element being visited.

Finishing up is as easy as adding X-script to the name and id tags where the text content would occur in the original XML document:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist xd:script="options ignoreOther;occurs *;">
<id>
onTrue out("\t" + getText());
</id>
<name>
onTrue out("\t" + getText());
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>

When X-definition detects text in either element, the element's content (preceded by a tab character) is written to standard output. As noted earlier, the finally clause in the master element's xd:script tag will terminate the table row at last.

With our .xdef file complete (just as we previewed it at the start of this article), we're ready to run X-definition!

Running X-definition

Instructions for calling the X-definition API are contained in sections 9 and 9.1 of the manual (8.1 in the version on GitHub). Much of the work that will go into our Java source file is to do with importing the handful of interfaces and classes that our code will require:

    Java
   
   import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;

The main method really needn't do more than instantiate our driver class, which I've named Xdefinition:

import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;

public class Xdefinition {
        
    public static void main(String[] argv) throws Exception
        {
            Xdefinition xdefinition = new Xdefinition();
        }

As was previewed in the beginning, the initializer block shall consist of all of three lines of code.

Our first line needs to compile our .xdef file, which I named "markup.xdef", into an XDPool:

    Java
   
   XDPool xpool = XDFactory.compileXD(null,"markup.xdef");

XDPool is defined in the API as an interface: here, you obtain an instance by passing the name of the .xdef file to the XDFactory class's compileXD() method. Were the first argument to compileXD() not null, it would have been the name of a properties file (details are in the API documentation for org.xdef.XDFactory).

The next line of the initializer block is simple:

    Java
   
   XDDocument xdoc = xpool.createXDDocument();

We just call our XDPool instance's createXDDocument() method in order to obtain an XDDocument instance. XDDocument is defined in the API as an interface. The API JavaDoc implies that instances thereof do the actual processing of our XML document.

Indeed, the final line is what triggers the processing:

    Java
   
   xdoc.xparse("discogs_20211201_masters.xml",new NullReportWriter(false));

The XML document (discogs_20211201_masters.xml) has been presumed to be located in the same directory as the .xdef file passed earlier to compileXD(). Our XDDocument instance's xparse() method effectively starts X-definition. X-definition more or less requires that any error output go somewhere: accordingly, an instance of a dedicated class defined in the API, NullReportWriter, is passed as xparse()'s second argument, after the location of the source XML. The API documentation for org.xdef.sys.NullReportWriter suggests the potential consequences of passing true rather than false as a parameter when making a new instance.

The code we will compile, including the initializer block, looks like this:

import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;

public class Xdefinition {
        
    public static void main(String[] argv) throws Exception
        {
            Xdefinition xdefinition = new Xdefinition();
        }
        {
            XDPool xpool = XDFactory.compileXD(null,"markup.xdef");
            XDDocument xdoc = xpool.createXDDocument();
            xdoc.xparse("discogs_20211201_masters.xml",new NullReportWriter(false));
        }
}

In order for it to compile by means of the javac command, we need to have the xdef-41.0.2.jar file on our classpath. We wired the locations of our .xdef file and our large XML document into our code, so all we need in order to finally run X-definition is a command like:

    Plain Text
   
   java -cp .:xdef-41.0.2.jar Xdefinition

When X-definition runs, the text table shall spill out onto the terminal unless your command's output was redirected. What you won't see in any case are any error messages, because the NullReportWriter class has the distinction of making the error output disappear. You don't even learn whether or not there was any.

The alternative to passing a NullReportWriter instance to xparse() is to pass an instance of org.xdef.sys.FileReportWriter. Had we chosen to do, our code could then have been as follows:

import org.xdef.sys.FileReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;

public class Xdefinition {
        
    public static void main(String[] argv) throws Exception
        {
            Xdefinition xdefinition = new Xdefinition();
        }
        {
            XDPool xpool = XDFactory.compileXD(null,"markup.xdef");
            XDDocument xdoc = xpool.createXDDocument();
            xdoc.xparse("discogs_20211201_masters.xml",new FileReportWriter("/dev/stderr"));
        }
}

The outcome will be identical if the above code is compiled and run simply because I made sure earlier that our .xdef file wouldn't cause X-definition to detect any errors. If you use the FileReportWriter class and really have got errors, you'll likely wind up with a terminal (or file: above, I used the location of my system's standard error stream as the FileReportWriter constructor's argument rather than store potential error messages to disc) great and big with verbose output; notwithstanding which, your code will still run and you might not necessarily have to take account of errors provided the output you find is all that you anticipated. Strictly avoiding errors as I've done here is, I'm afraid, less a question of convenience than it is of using X-definition consistently with its overall design.

You see, X-definition is designed principally for checking whether a given XML document is valid while you process it. We didn't purposefully attempt it here, but you can, if you wish, set the criteria much as you could were you using a schema-aware XSLT processor instead. Quite as iText's developer built on his knowledge of the PDF specification, the developers of X-definition familiarized themselves with XSD and devised an alternative means of realizing its possibilities.

XML Element Data (computing) Document Attribute (computing) Database

Opinions expressed by DZone contributors are their own.

Related

Trending