XMLReader vs XmlExtractKit for Real XML Extraction Tasks in PHP

Compare raw XMLReader with XmlExtractKit on a real extraction task: complex and repeated XML records in, plain PHP arrays out.

Nicholas Volkhin

Jun. 08, 26 · Analysis

Likes (0)

Comment

Save

93 Views

When PHP developers compare XML approaches, the comparison often starts in the wrong place.

It usually becomes a vague question like this: "What is the best XML library for PHP?" That is too broad to be useful.

In real projects, the question is usually much narrower:

I have a large XML file
It contains repeated business records
I only need some of those records
I want application-friendly PHP data, not a full in-memory XML tree

That is not a general XML problem. It is an extraction task.

And for this kind of work, the most honest comparison is often not between two third-party packages. It is between:

Raw XMLReader, where you write the extraction logic yourself;
A focused extraction toolkit, where the streaming model stays
The same, but the glue code becomes reusable.

In my case, that focused toolkit is XmlExtractKit, published as sbwerewolf/xml-navigator. This article compares both approaches on the same practical task.

The Task

Suppose we have a large XML feed that contains repeated `<offer>` records, mixed with other node types that we do not care about.

We want to extract each offer into a PHP array with a shape like this:

    PHP
   
 

   [
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.90,
    'currency' => 'USD',
]
  

Here is the sample XML:

    XML
   
 

   <?xml version="1.0" encoding="UTF-8"?>
<catalog>
  <offer id="1001" available="true">
    <name>Keyboard</name>
    <price currency="USD">49.90</price>
  </offer>
  <service id="s-1">
    <name>Warranty</name>
  </service>
  <offer id="1002" available="false">
    <name>Mouse</name>
    <price currency="USD">19.90</price>
  </offer>
</catalog>
  

This is a simple example, but it is representative of a lot of real XML integration work: repeated nodes, some attributes, some nested values, and other elements that should be ignored.

Option 1: Raw XMLReader

The low-level memory-safe baseline in PHP is XMLReader. That makes it the right foundation for large-file extraction.

Here is one way to solve the task with plain `XMLReader` and a small amount of helper parsing:

    PHP
   
 

   $reader = new XMLReader();

if (! $reader->open('feed.xml')) {
    throw new RuntimeException('Cannot open XML file.');
}

$rows = [];

while ($reader->read()) {
    if (
        $reader->nodeType !== XMLReader::ELEMENT
        || $reader->name !== 'offer'
    ) {
        continue;
    }

    $offerXml = $reader->readOuterXML();
    $offer = simplexml_load_string($offerXml);

    if ($offer === false) {
        continue;
    }

    $rows[] = [
        'id' => (string) $offer['id'],
        'available' => ((string) $offer['available']) === 'true',
        'name' => (string) $offer->name,
        'price' => (float) $offer->price,
        'currency' => (string) $offer->price['currency'],
    ];
}

$reader->close();

var_export($rows);
  

Output:

    PHP
   
 

   array (
  0 =>
  array (
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.9,
    'currency' => 'USD',
  ),
  1 =>
  array (
    'id' => '1002',
    'available' => false,
    'name' => 'Mouse',
    'price' => 19.9,
    'currency' => 'USD',
  ),
)
  

This is a perfectly valid solution.

It is memory-safe in the important sense: we are not loading the whole XML document into memory. We are moving through the stream and extracting matching nodes.

For a one-off task, this may be enough. But there are tradeoffs.

What the Raw XMLReader Version Costs You

The raw XMLReader version works, but its cost is not obvious when the example is this small. The real cost shows up later:

Matching logic has to be repeated or abstracted;
Field extraction rules are embedded directly in the loop;
Nested XML handling becomes more verbose;
Attributes and text values require repeated manual decisions;
Optional fields quickly add conditionals;
The same extraction pattern gets reimplemented across projects.

This is the critical point: the issue is not whether `XMLReader` is
capable. It absolutely is.

The issue is whether low-level cursor code is the right place to keep business extraction logic once the project grows beyond a toy example.

Option 2: XmlExtractKit on Top of XMLReader

Now, let us solve the same extraction task using XmlExtractKit.

The important thing to understand is that the streaming model does not change. Under the hood, the workflow is still based on XMLReader.

What changes is the level of abstraction. Instead of manually managing cursor flow and converting node fragments inline, the library lets me say:

Stream through the XML
Select matching nodes
Receive structured PHP arrays for those nodes

Here is the same scenario using FastXmlParser::extractHierarchy() and XmlElement:

    PHP
   
 

   use SbWereWolf\XmlNavigator\Navigation\XmlElement;
use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

require_once __DIR__ . '/vendor/autoload.php';

$reader = new XMLReader();

if (! $reader->open('feed.xml')) {
    throw new RuntimeException('Cannot open XML file.');
}

$rows = [];

foreach (
    FastXmlParser::extractHierarchy(
        $reader,
        static fn (XMLReader $cursor): bool =>
            $cursor->nodeType === XMLReader::ELEMENT
            && $cursor->name === 'offer'
    ) as $offerData
) {
    $offer = new XmlElement($offerData);
    $name = $offer->pull('name')->current();
    $price = $offer->pull('price')->current();

    $rows[] = [
        'id' => $offer->get('id'),
        'available' => $offer->get('available') === 'true',
        'name' => $name?->value() ?? '',
        'price' => (float) ($price?->value() ?? 0),
        'currency' => $price?->get('currency') ?? '',
    ];
}

$reader->close();

var_export($rows);
  

The result is the same kind of application-level array:

    PHP
   
 

   array (
  0 =>
  array (
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.9,
    'currency' => 'USD',
  ),
  1 =>
  array (
    'id' => '1002',
    'available' => false,
    'name' => 'Mouse',
    'price' => 19.9,
    'currency' => 'USD',
  ),
)
  

That is the key comparison.

Both approaches are streaming-based. Both avoid loading the full XML document into memory. Both can solve the same extraction task.

The difference is where the complexity lives.

The Practical Difference

With raw XMLReader, the extraction loop carries several responsibilities at once:

Traversal
Node matching
Fragment parsing
Data mapping
Shape normalization

With XmlExtractKit, traversal remains streaming-based, but extraction becomes more explicit and reusable.

That matters because most XML integration code is not judged only by whether it works today. It is judged by what happens when you need to:

Add another field
Support optional nodes
Process another repeated element type
Reuse the same extraction pattern in a second project
Hand the code to someone else six months later

In other words, the comparison is not just about performance. It is about where you want complexity to accumulate.

What Raw XMLReader Is Still Excellent For

It is worth being very clear here: this is not an argument against XMLReader.

XMLReader is the right foundation for large XML handling in PHP.

And there are cases where staying close to the metal is still the best option:

The task is small and one-off
You need very custom cursor-level logic
The extraction rules are extremely specific
Introducing another abstraction would not pay for itself

When that is the case, use raw `XMLReader` and move on. That is a completely reasonable engineering choice.

Where XmlExtractKit Starts Paying Off

A focused extraction toolkit starts making sense when the job repeats.

That usually means one or more of these are true:

XML files are large enough that streaming is mandatory
Extraction is a recurring integration pattern
The codebase needs arrays, not XML trees
Multiple projects solve similar feed or import tasks
You want a stable intermediate representation of XML records
You want the extraction code to read like the task, not like cursor choreography

That is the use case I built sbwerewolf/xml-navigator for.

I did not want a general-purpose XML mega-toolkit. I wanted a practical way to keep the memory-safe streaming model while reducing how much extraction glue code I had to keep rewriting.

A More Honest Way to Compare XML Tools

One of the reasons XML discussions become unhelpful is that people compare tools that are not aimed at the same job.

A better comparison framework looks like this:

DOM / SimpleXML when the document is small and full-tree convenience matters
- raw XMLReader when the file is large, and the task is custom enough that low-level control is worth it
- XmlExtractKit when the file is large, the task is extraction-focused, and you want structured arrays instead of repeated cursor glue

That is much more useful than asking for a universal winner. There is no universal winner. There is only one better fit for the task in front of you.

So Which One Should You Choose?

Here is my practical answer.

Choose raw XMLReader when:

You want maximal control
The task is narrow
The extraction code will probably never be reused
A little extra boilerplate is acceptable

Choose XmlExtractKit when:

You keep solving the same extraction problem repeatedly
You want the XML stage to produce structured PHP arrays
You want extraction code that is easier to read and maintain
You want to stay streaming-first without hand-writing the same conversion patterns again and again

Conclusion

For real XML extraction tasks in PHP, the main decision is usually not "which XML package is best?"

It is this: Do I want to keep solving this at the raw XMLReader level, or do I want a reusable extraction-oriented layer on top of the same streaming model?

That is the honest comparison.

XMLReader is still the correct low-level foundation for large XML files.

But if your actual problem is repeated extraction of business records into plain PHP arrays, then XmlExtractKit (sbwerewolf/xml-navigator) is designed to make that workflow cleaner, more reusable, and easier to maintain.

Try It

    Shell
   
   composer require sbwerewolf/xml-navigator

Explore the demo project:

    Shell
   
   git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install

Please discuss this on dev.to.

PHP XML Task (computing)

Published at DZone with permission of Nicholas Volkhin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending