How to Parse Large XML Files in PHP Without Running Out of Memory

Stream huge XML feeds in PHP, extract only matching nodes, and convert them into plain PHP arrays without loading the whole document.

Nicholas Volkhin

Jun. 05, 26 · Tutorial

Likes (0)

Comment

Save

2.8K Views

XML is still everywhere: supplier feeds, marketplace catalogs, partner exports, legacy APIs, SOAP-ish payloads, ETL jobs. None of that is glamorous, but plenty of production systems still depend on it.

The real problem starts when the file is no longer small.

At that point, the question is not really "How do I parse XML in PHP?" It becomes:
How do I process a large XML document safely, extract only the records I care about, and keep the rest of my application working with normal PHP data structures?

That is a very different problem.

In many real-world integrations, you do not need the whole XML document in memory. You do not need to traverse every branch of the tree. You do not need a rich DOM-style model.

You usually need something much simpler:

Scan the file efficiently
Find repeated business records such as `product`, `offer`, or `item`
Extract those records
Turn them into arrays
Pass them to the rest of your pipeline

That is the approach I use in modern PHP projects, and it is the one I recommend for large XML workloads.

Why Naive XML Parsing Stops Working

For small files, the usual PHP XML tools are perfectly fine. A typical first solution looks like this:

    PHP
   
   $xml = simplexml_load_file('feed.xml');

foreach ($xml->products->product as $product) {
// process product
}

There is nothing wrong with that when the file is small, and the document structure is simple.

The trouble is that this style of code implicitly treats the XML file as something you want to load and work with as a whole. For large feeds, that is often the wrong tradeoff.

If you only need repeated business records from a large XML file, materializing the entire document in memory is unnecessary work. It also makes your pipeline more fragile as feeds grow over time.

This is why large-XML handling should start with a different mental model:

Do not load the document. Stream through it and extract only what matters.

The Real Task Is Usually Extraction, Not XML Manipulation

In practice, most XML processing jobs in application code look like this:

The file contains many repeated records
You only need a subset of them
You only need some fields from each record
The result will end up in arrays, JSON, a database, or a queue

That means the business task is usually not "work with XML as a document." It is: Find the repeated records I care about and turn them into application-friendly data.

That distinction matters because it leads directly to the right low-memory approach.

The Memory-Safe Foundation: XMLReader

In PHP, the standard low-level tool for memory-safe XML traversal is `XMLReader`.

Instead of loading the entire document, it lets you move through the XML cursor-style, node by node. That is exactly what you want when the file is large.

Here is a minimal baseline example:

    PHP
   
 

   $reader = new XMLReader();

if (! $reader->open('feed.xml')) {
	throw new RuntimeException('Cannot open XML file.');
}

while ($reader->read()) {
if (
	$reader->nodeType === XMLReader::ELEMENT
	&& $reader->name === 'product'
) {
	$nodeXml = $reader->readOuterXML();

	$product = simplexml_load_string($nodeXml);

	$data = [
		'id' => (string) $product->id,
		'name' => (string) $product->name,
		'price' => (float) $product->price,
		'available' => (string) $product->available,
	];

	// process $data immediately
	}
}

$reader->close();
  

This is already much better than loading the full file up front. It gives you the right execution model:

Sequential reading
Low memory pressure
Immediate processing of extracted records

If your XML task is simple and one-off, this may be enough. But once you do this in more than one project, the weak points show up quickly.

Where Raw XMLReader Starts to Hurt

XMLReader is powerful, but it is also low-level. The moment your extraction task becomes slightly more realistic, you start accumulating glue code:

Repeated node-selection logic
Conversion of XML fragments into arrays
Nested element handling
Attributes versus values
Optional nodes
Repeated fields like multiple `<picture>` tags
Serialization to JSON-friendly structures
Duplicated extraction code across projects

At that point, memory is no longer the only concern. Maintainability becomes the real cost.

This is the line I care about most in application code: not just "can I stream it," but "can I keep the extraction logic readable after the third similar integration?"

A More Practical Extraction-First Approach

This is exactly why I built XmlExtractKit for PHP, published as `sbwerewolf/xml-navigator`.

The goal is not to replace `XMLReader`, but to keep its streaming model while moving application code closer to the actual business task. Instead of managing the cursor manually and assembling records by hand, I want code that says:

Open a large XML stream

Match the elements I care about
Get plain PHP arrays back

Here is a streaming example using the library:

    PHP
   
 

   use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

require_once __DIR__ . '/vendor/autoload.php';

$uri = tempnam(sys_get_temp_dir(), 'xml-extract-kit-');
file_put_contents($uri, <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
	<offer id="1001" available="true">
		<name>Keyboard</name>
		<price currency="USD">49.90</price>
	</offer>
	<service id="s-1">
		<name>Warranty</name>
	</service>
	<offer id="1002" available="false">
		<name>Mouse</name>
		<price currency="USD">19.90</price>
	</offer>
</catalog>
XML);

$reader = XMLReader::open($uri);

if ($reader === false) {
	throw new RuntimeException('Cannot open XML file.');
}

$offers = FastXmlParser::extractPrettyPrint(
	$reader,
	static fn (XMLReader $cursor): bool =>
		$cursor->nodeType === XMLReader::ELEMENT
		&& $cursor->name === 'offer'
	);

foreach ($offers as $offer) {
	echo json_encode(
		$offer,
	JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES
	) . PHP_EOL;
}

$reader->close();
unlink($uri);
  

The output is application-friendly:

    JSON
   
 

   {
	"offer": {
		"@attributes": {
			"id": "1001",
			"available": "true"
			},
		"name": "Keyboard",
		"price": {
			"@value": "49.90",
			"@attributes": {
				"currency": "USD"
			}
		}
	}
}
  

    JSON
   
 

   {
	"offer": {
		"@attributes": {
			"id": "1002",
			"available": "false"
		},
		"name": "Mouse",
		"price": {
			"@value": "19.90",
			"@attributes": {
				"currency": "USD"
			}
		}
	}
}
  

This is still a streaming workflow. The difference is that the code is now centered on the extraction task instead of low-level cursor management.

That becomes more valuable when the XML structure is nested, partially optional, or reused across multiple integrations.

Why Plain Arrays Are Often the Right Output

A lot of application code does not really want XML. It wants data.

Once the relevant record has been extracted, the rest of the system usually prefers:

Plain arrays
Normalized values
JSON-ready structures
Data that can be validated, transformed, and persisted

That is why I think "XML extraction" is a more useful framing than "XML handling."

Most business systems do not want to live inside an XML tree. They want to move past it as quickly as possible.

If the XML document is just a transport format, then the best workflow is usually:

XML stream -> selected nodes -> PHP arrays

That is the design center of my library.

When This Approach Makes Sense

This style of XML processing works especially well when:

The XML file is large
The document contains many repeated records
You only need part of the document
The extracted data should be processed immediately
The rest of the application works with arrays, not DOM objects

Typical examples include:

Supplier and marketplace feeds
Product catalogs
Partner imports and exports
ETL jobs
Queue payload preparation
Legacy integration endpoints that still speak XML

When You Probably Do Not Need It

There are also cases where this is the wrong tool.
You probably do not need a streaming extraction approach when:

The XML is small
Loading the whole file is acceptable
You need full-document manipulation
Your task is closer to DOM transformation than record extraction
The XML structure is simple enough that a tiny one-off script is enough

That is important to say explicitly. Not every XML task needs an extraction-first workflow. But the ones that do usually benefit from it immediately.

A Useful Rule of Thumb

Here is the simplest practical rule I know:

If the XML is small and you need the whole document, convenience APIs are fine.
If the XML is large and you only need repeated records, stream it.
If you keep solving the same streaming extraction problem in multiple projects, stop writing the same glue code over and over.

That is the point where a focused library becomes worth it.

Conclusion

Large XML files are not primarily a parsing problem. They are an extraction problem.

If you treat them like full in-memory documents, you often pay too much in memory and complexity. If you treat them like streams of repeated business records, the solution becomes safer, simpler, and much easier to fit into modern PHP pipelines.

XMLReader gives you the right low-level foundation for that model.

And if your real task is not "load XML," but "extract matching records and turn them into plain PHP arrays," then XmlExtractKit (`sbwerewolf/xml-navigator`) was built exactly for that workflow.

Try It

    Shell
   
   composer require sbwerewolf/xml-navigator

Explore the demo project:

    Shell
   
   git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install

Please discuss this on dev.to.

PHP XML Memory (storage engine)

Published at DZone with permission of Nicholas Volkhin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending