DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Migrate Serialized Java Objects with XStream and XMT
  • What Is Ant, Really?
  • How to Save Money Using Custom LLMs for Specific Tasks
  • Building a Skill-Based Agentic Reviewer with Claude Code: A Practical Guide Using Skills.MD, MCP Servers, Tools, and Tasks

Trending

  • When Snowflake Lies to You: Understanding False Failures in dbt Pipelines
  • Optimizing High-Volume REST APIs Using Redis Caching and Spring Boot (With Load Testing Code)
  • The Hidden Cost of AI Tokens: Engineering Patterns for 10x Resource Efficiency
  • How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
  1. DZone
  2. Coding
  3. Languages
  4. XMLReader vs XmlExtractKit for Real XML Extraction Tasks in PHP

XMLReader vs XmlExtractKit for Real XML Extraction Tasks in PHP

Compare raw XMLReader with XmlExtractKit on a real extraction task: complex and repeated XML records in, plain PHP arrays out.

By 
Nicholas Volkhin user avatar
Nicholas Volkhin
·
Jun. 08, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
93 Views

Join the DZone community and get the full member experience.

Join For Free

When PHP developers compare XML approaches, the comparison often starts in the wrong place.

It usually becomes a vague question like this: "What is the best XML library for PHP?" That is too broad to be useful.

In real projects, the question is usually much narrower:

  • I have a large XML file
  • It contains repeated business records
  • I only need some of those records
  • I want application-friendly PHP data, not a full in-memory XML tree

That is not a general XML problem. It is an extraction task.

And for this kind of work, the most honest comparison is often not between two third-party packages. It is between:

  • Raw XMLReader, where you write the extraction logic yourself;
  • A focused extraction toolkit, where the streaming model stays
  • The same, but the glue code becomes reusable.

In my case, that focused toolkit is XmlExtractKit, published as sbwerewolf/xml-navigator. This article compares both approaches on the same practical task.

The Task

Suppose we have a large XML feed that contains repeated `<offer>` records, mixed with other node types that we do not care about.

We want to extract each offer into a PHP array with a shape like this:

PHP
 
[
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.90,
    'currency' => 'USD',
]


Here is the sample XML:

XML
 
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
  <offer id="1001" available="true">
    <name>Keyboard</name>
    <price currency="USD">49.90</price>
  </offer>
  <service id="s-1">
    <name>Warranty</name>
  </service>
  <offer id="1002" available="false">
    <name>Mouse</name>
    <price currency="USD">19.90</price>
  </offer>
</catalog>


This is a simple example, but it is representative of a lot of real XML integration work: repeated nodes, some attributes, some nested values, and other elements that should be ignored.

Option 1: Raw XMLReader

The low-level memory-safe baseline in PHP is XMLReader. That makes it the right foundation for large-file extraction.

Here is one way to solve the task with plain `XMLReader` and a small amount of helper parsing:

PHP
 
$reader = new XMLReader();

if (! $reader->open('feed.xml')) {
    throw new RuntimeException('Cannot open XML file.');
}

$rows = [];

while ($reader->read()) {
    if (
        $reader->nodeType !== XMLReader::ELEMENT
        || $reader->name !== 'offer'
    ) {
        continue;
    }

    $offerXml = $reader->readOuterXML();
    $offer = simplexml_load_string($offerXml);

    if ($offer === false) {
        continue;
    }

    $rows[] = [
        'id' => (string) $offer['id'],
        'available' => ((string) $offer['available']) === 'true',
        'name' => (string) $offer->name,
        'price' => (float) $offer->price,
        'currency' => (string) $offer->price['currency'],
    ];
}

$reader->close();

var_export($rows);


Output:

PHP
 
array (
  0 =>
  array (
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.9,
    'currency' => 'USD',
  ),
  1 =>
  array (
    'id' => '1002',
    'available' => false,
    'name' => 'Mouse',
    'price' => 19.9,
    'currency' => 'USD',
  ),
)


This is a perfectly valid solution.

It is memory-safe in the important sense: we are not loading the whole XML document into memory. We are moving through the stream and extracting matching nodes.

For a one-off task, this may be enough. But there are tradeoffs.

What the Raw XMLReader Version Costs You

The raw XMLReader version works, but its cost is not obvious when the example is this small. The real cost shows up later:

  • Matching logic has to be repeated or abstracted;
  • Field extraction rules are embedded directly in the loop;
  • Nested XML handling becomes more verbose;
  • Attributes and text values require repeated manual decisions;
  • Optional fields quickly add conditionals;
  • The same extraction pattern gets reimplemented across projects.

This is the critical point: the issue is not whether `XMLReader` is
capable. It absolutely is.

The issue is whether low-level cursor code is the right place to keep business extraction logic once the project grows beyond a toy example.

Option 2: XmlExtractKit on Top of XMLReader

Now, let us solve the same extraction task using XmlExtractKit.

The important thing to understand is that the streaming model does not change. Under the hood, the workflow is still based on XMLReader. 

What changes is the level of abstraction. Instead of manually managing cursor flow and converting node fragments inline, the library lets me say:

  • Stream through the XML

  • Select matching nodes
  • Receive structured PHP arrays for those nodes

Here is the same scenario using FastXmlParser::extractHierarchy() and XmlElement:

PHP
 
use SbWereWolf\XmlNavigator\Navigation\XmlElement;
use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

require_once __DIR__ . '/vendor/autoload.php';

$reader = new XMLReader();

if (! $reader->open('feed.xml')) {
    throw new RuntimeException('Cannot open XML file.');
}

$rows = [];

foreach (
    FastXmlParser::extractHierarchy(
        $reader,
        static fn (XMLReader $cursor): bool =>
            $cursor->nodeType === XMLReader::ELEMENT
            && $cursor->name === 'offer'
    ) as $offerData
) {
    $offer = new XmlElement($offerData);
    $name = $offer->pull('name')->current();
    $price = $offer->pull('price')->current();

    $rows[] = [
        'id' => $offer->get('id'),
        'available' => $offer->get('available') === 'true',
        'name' => $name?->value() ?? '',
        'price' => (float) ($price?->value() ?? 0),
        'currency' => $price?->get('currency') ?? '',
    ];
}

$reader->close();

var_export($rows);


The result is the same kind of application-level array:

PHP
 
array (
  0 =>
  array (
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.9,
    'currency' => 'USD',
  ),
  1 =>
  array (
    'id' => '1002',
    'available' => false,
    'name' => 'Mouse',
    'price' => 19.9,
    'currency' => 'USD',
  ),
)


That is the key comparison.

Both approaches are streaming-based. Both avoid loading the full XML document into memory. Both can solve the same extraction task.

The difference is where the complexity lives.

The Practical Difference

With raw XMLReader, the extraction loop carries several responsibilities at once:

  • Traversal
  • Node matching
  • Fragment parsing
  • Data mapping
  • Shape normalization

With XmlExtractKit, traversal remains streaming-based, but extraction becomes more explicit and reusable.

That matters because most XML integration code is not judged only by whether it works today. It is judged by what happens when you need to:

  • Add another field
  • Support optional nodes
  • Process another repeated element type
  • Reuse the same extraction pattern in a second project
  • Hand the code to someone else six months later

In other words, the comparison is not just about performance. It is about where you want complexity to accumulate.

What Raw XMLReader Is Still Excellent For

It is worth being very clear here: this is not an argument against XMLReader.

XMLReader is the right foundation for large XML handling in PHP.

And there are cases where staying close to the metal is still the best option:

  • The task is small and one-off
  • You need very custom cursor-level logic
  • The extraction rules are extremely specific
  • Introducing another abstraction would not pay for itself

When that is the case, use raw `XMLReader` and move on. That is a completely reasonable engineering choice.

Where XmlExtractKit Starts Paying Off

A focused extraction toolkit starts making sense when the job repeats.

That usually means one or more of these are true:

  • XML files are large enough that streaming is mandatory
  • Extraction is a recurring integration pattern
  • The codebase needs arrays, not XML trees
  • Multiple projects solve similar feed or import tasks
  • You want a stable intermediate representation of XML records
  • You want the extraction code to read like the task, not like cursor choreography

That is the use case I built sbwerewolf/xml-navigator for.

I did not want a general-purpose XML mega-toolkit. I wanted a practical way to keep the memory-safe streaming model while reducing how much extraction glue code I had to keep rewriting.

A More Honest Way to Compare XML Tools

One of the reasons XML discussions become unhelpful is that people compare tools that are not aimed at the same job.

A better comparison framework looks like this:

  • DOM / SimpleXML when the document is small and full-tree convenience matters
  • - raw XMLReader when the file is large, and the task is custom enough that low-level control is worth it
  • - XmlExtractKit when the file is large, the task is extraction-focused, and you want structured arrays instead of repeated cursor glue

That is much more useful than asking for a universal winner. There is no universal winner. There is only one better fit for the task in front of you.

So Which One Should You Choose?

Here is my practical answer.

Choose raw XMLReader when:

  •  You want maximal control
  • The task is narrow
  • The extraction code will probably never be reused
  • A little extra boilerplate is acceptable

Choose XmlExtractKit when:

  • You keep solving the same extraction problem repeatedly

  • You want the XML stage to produce structured PHP arrays
  • You want extraction code that is easier to read and maintain
  • You want to stay streaming-first without hand-writing the same conversion patterns again and again

Conclusion

For real XML extraction tasks in PHP, the main decision is usually not "which XML package is best?"

It is this: Do I want to keep solving this at the raw XMLReader level, or do I want a reusable extraction-oriented layer on top of the same streaming model?

That is the honest comparison.

XMLReader is still the correct low-level foundation for large XML files.

But if your actual problem is repeated extraction of business records into plain PHP arrays, then XmlExtractKit (sbwerewolf/xml-navigator) is designed to make that workflow cleaner, more reusable, and easier to maintain.

Try It

Shell
 
composer require sbwerewolf/xml-navigator


Explore the demo project:

Shell
 
git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install


Please discuss this on dev.to.

PHP XML Task (computing)

Published at DZone with permission of Nicholas Volkhin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Migrate Serialized Java Objects with XStream and XMT
  • What Is Ant, Really?
  • How to Save Money Using Custom LLMs for Specific Tasks
  • Building a Skill-Based Agentic Reviewer with Claude Code: A Practical Guide Using Skills.MD, MCP Servers, Tools, and Tasks

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook