DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Persistent Memory for AI Agents Using LangChain's Deep Agents
  • The Serverless Illusion: When “Pay for What You Use” Becomes Expensive
  • The Hidden Engineering Cost of XML in Enterprise Development Workflows
  • Memory Optimization and Utilization in Java 25 LTS: Practical Best Practices

Trending

  • Slopsquatting: Building a Scanner That Catches AI-Hallucinated Packages Before They Reach Production
  • Your AI Agent Tests Are Passing, But Your Agent Is Still Broken
  • GenAI Implementation Isn't Magic — It’s a Lifecycle
  • Why Stable RAG Answers Can Still Hide Unstable Evidence
  1. DZone
  2. Coding
  3. Languages
  4. How to Parse Large XML Files in PHP Without Running Out of Memory

How to Parse Large XML Files in PHP Without Running Out of Memory

Stream huge XML feeds in PHP, extract only matching nodes, and convert them into plain PHP arrays without loading the whole document.

By 
Nicholas Volkhin user avatar
Nicholas Volkhin
·
Jun. 05, 26 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
143 Views

Join the DZone community and get the full member experience.

Join For Free

XML is still everywhere: supplier feeds, marketplace catalogs, partner exports, legacy APIs, SOAP-ish payloads, ETL jobs. None of that is glamorous, but plenty of production systems still depend on it.

The real problem starts when the file is no longer small.

At that point, the question is not really "How do I parse XML in PHP?" It becomes:
How do I process a large XML document safely, extract only the records I care about, and keep the rest of my application working with normal PHP data structures?

That is a very different problem.

In many real-world integrations, you do not need the whole XML document in memory. You do not need to traverse every branch of the tree. You do not need a rich DOM-style model.

You usually need something much simpler:

  • Scan the file efficiently
  • Find repeated business records such as `product`, `offer`, or `item`
  • Extract those records
  • Turn them into arrays
  • Pass them to the rest of your pipeline

That is the approach I use in modern PHP projects, and it is the one I recommend for large XML workloads.

Why Naive XML Parsing Stops Working

For small files, the usual PHP XML tools are perfectly fine. A typical first solution looks like this:

PHP
 
$xml = simplexml_load_file('feed.xml');

foreach ($xml->products->product as $product) {
// process product
}


There is nothing wrong with that when the file is small, and the document structure is simple.

The trouble is that this style of code implicitly treats the XML file as something you want to load and work with as a whole. For large feeds, that is often the wrong tradeoff.

If you only need repeated business records from a large XML file, materializing the entire document in memory is unnecessary work. It also makes your pipeline more fragile as feeds grow over time.

This is why large-XML handling should start with a different mental model:

Do not load the document. Stream through it and extract only what matters.

The Real Task Is Usually Extraction, Not XML Manipulation

In practice, most XML processing jobs in application code look like this:

  • The file contains many repeated records
  • You only need a subset of them
  • You only need some fields from each record
  • The result will end up in arrays, JSON, a database, or a queue

That means the business task is usually not "work with XML as a document." It is: Find the repeated records I care about and turn them into application-friendly data.

That distinction matters because it leads directly to the right low-memory approach.

The Memory-Safe Foundation: XMLReader

In PHP, the standard low-level tool for memory-safe XML traversal is `XMLReader`.

Instead of loading the entire document, it lets you move through the XML cursor-style, node by node. That is exactly what you want when the file is large.

Here is a minimal baseline example:

PHP
 
$reader = new XMLReader();

if (! $reader->open('feed.xml')) {
	throw new RuntimeException('Cannot open XML file.');
}

while ($reader->read()) {
if (
	$reader->nodeType === XMLReader::ELEMENT
	&& $reader->name === 'product'
) {
	$nodeXml = $reader->readOuterXML();

	$product = simplexml_load_string($nodeXml);

	$data = [
		'id' => (string) $product->id,
		'name' => (string) $product->name,
		'price' => (float) $product->price,
		'available' => (string) $product->available,
	];

	// process $data immediately
	}
}

$reader->close();


This is already much better than loading the full file up front. It gives you the right execution model:

  • Sequential reading
  • Low memory pressure
  • Immediate processing of extracted records

If your XML task is simple and one-off, this may be enough. But once you do this in more than one project, the weak points show up quickly.

Where Raw XMLReader Starts to Hurt

 XMLReader is powerful, but it is also low-level. The moment your extraction task becomes slightly more realistic, you start accumulating glue code:

  • Repeated node-selection logic
  • Conversion of XML fragments into arrays
  • Nested element handling
  • Attributes versus values
  • Optional nodes
  • Repeated fields like multiple `<picture>` tags
  • Serialization to JSON-friendly structures
  • Duplicated extraction code across projects

At that point, memory is no longer the only concern. Maintainability becomes the real cost.

This is the line I care about most in application code: not just "can I stream it," but "can I keep the extraction logic readable after the third similar integration?"

A More Practical Extraction-First Approach

This is exactly why I built XmlExtractKit for PHP, published as `sbwerewolf/xml-navigator`.

The goal is not to replace `XMLReader`, but to keep its streaming model while moving application code closer to the actual business task. Instead of managing the cursor manually and assembling records by hand, I want code that says:

  • Open a large XML stream


  • Match the elements I care about
  • Get plain PHP arrays back

Here is a streaming example using the library:

PHP
 
use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

require_once __DIR__ . '/vendor/autoload.php';

$uri = tempnam(sys_get_temp_dir(), 'xml-extract-kit-');
file_put_contents($uri, <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
	<offer id="1001" available="true">
		<name>Keyboard</name>
		<price currency="USD">49.90</price>
	</offer>
	<service id="s-1">
		<name>Warranty</name>
	</service>
	<offer id="1002" available="false">
		<name>Mouse</name>
		<price currency="USD">19.90</price>
	</offer>
</catalog>
XML);

$reader = XMLReader::open($uri);

if ($reader === false) {
	throw new RuntimeException('Cannot open XML file.');
}

$offers = FastXmlParser::extractPrettyPrint(
	$reader,
	static fn (XMLReader $cursor): bool =>
		$cursor->nodeType === XMLReader::ELEMENT
		&& $cursor->name === 'offer'
	);

foreach ($offers as $offer) {
	echo json_encode(
		$offer,
	JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES
	) . PHP_EOL;
}

$reader->close();
unlink($uri);


The output is application-friendly:

JSON
 
{
	"offer": {
		"@attributes": {
			"id": "1001",
			"available": "true"
			},
		"name": "Keyboard",
		"price": {
			"@value": "49.90",
			"@attributes": {
				"currency": "USD"
			}
		}
	}
}


JSON
 
{
	"offer": {
		"@attributes": {
			"id": "1002",
			"available": "false"
		},
		"name": "Mouse",
		"price": {
			"@value": "19.90",
			"@attributes": {
				"currency": "USD"
			}
		}
	}
}


This is still a streaming workflow. The difference is that the code is now centered on the extraction task instead of low-level cursor management.

That becomes more valuable when the XML structure is nested, partially optional, or reused across multiple integrations.

Why Plain Arrays Are Often the Right Output

A lot of application code does not really want XML. It wants data.

Once the relevant record has been extracted, the rest of the system usually prefers:

  • Plain arrays
  • Normalized values
  • JSON-ready structures
  • Data that can be validated, transformed, and persisted

That is why I think "XML extraction" is a more useful framing than "XML handling."

Most business systems do not want to live inside an XML tree. They want to move past it as quickly as possible.

If the XML document is just a transport format, then the best workflow is usually:

XML stream -> selected nodes -> PHP arrays

That is the design center of my library.

When This Approach Makes Sense

This style of XML processing works especially well when:

  • The XML file is large
  • The document contains many repeated records
  • You only need part of the document
  • The extracted data should be processed immediately
  • The rest of the application works with arrays, not DOM objects

Typical examples include:

  • Supplier and marketplace feeds
  • Product catalogs
  • Partner imports and exports
  • ETL jobs
  • Queue payload preparation
  • Legacy integration endpoints that still speak XML

When You Probably Do Not Need It

There are also cases where this is the wrong tool.
You probably do not need a streaming extraction approach when:

  • The XML is small
  • Loading the whole file is acceptable
  • You need full-document manipulation
  • Your task is closer to DOM transformation than record extraction
  • The XML structure is simple enough that a tiny one-off script is enough

That is important to say explicitly. Not every XML task needs an extraction-first workflow. But the ones that do usually benefit from it immediately.

A Useful Rule of Thumb

Here is the simplest practical rule I know:

  • If the XML is small and you need the whole document, convenience APIs are fine.
  • If the XML is large and you only need repeated records, stream it.
  • If you keep solving the same streaming extraction problem in multiple projects, stop writing the same glue code over and over.

That is the point where a focused library becomes worth it.

Conclusion

Large XML files are not primarily a parsing problem. They are an extraction problem.

If you treat them like full in-memory documents, you often pay too much in memory and complexity. If you treat them like streams of repeated business records, the solution becomes safer, simpler, and much easier to fit into modern PHP pipelines.

XMLReader gives you the right low-level foundation for that model.

And if your real task is not "load XML," but "extract matching records and turn them into plain PHP arrays," then XmlExtractKit (`sbwerewolf/xml-navigator`) was built exactly for that workflow.

Try It

Shell
 
composer require sbwerewolf/xml-navigator


Explore the demo project:

Shell
 
git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install


Please discuss this on dev.to.

PHP XML Memory (storage engine)

Published at DZone with permission of Nicholas Volkhin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Persistent Memory for AI Agents Using LangChain's Deep Agents
  • The Serverless Illusion: When “Pay for What You Use” Becomes Expensive
  • The Hidden Engineering Cost of XML in Enterprise Development Workflows
  • Memory Optimization and Utilization in Java 25 LTS: Practical Best Practices

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook