simdxml: structural indexing for XML

Applying simdjson's ideas to XML parsing. It's complicated, but it works.

At Amplified, we work with patent data. A lot of it. Patent documents worldwide are published in WIPO's ST.36 standard, which means XML. Deeply nested, attribute-heavy, namespace-laden XML. Our data warehouse is fundamentally a collection of these documents, and we're constantly querying into them to extract claims, titles, classifications, and citations.

We've been experimenting with DuckDB recently as a replacement for some of our heavier infrastructure, and I've been spending time with Arrow and SIMD techniques in Rust. At some point the obvious question came up: simdjson showed that SIMD can radically speed up JSON parsing by building a structural index in a single vectorised pass. Could the same ideas work for XML?

The short answer: it's complicated, but yes. The result is simdxml, a Rust library that applies SIMD structural indexing to XML and integrates XPath evaluation directly on the resulting flat arrays. It looks like it's faster than quick-xml for parsing (despite also building a full XPath-ready index), and it matches or beats pugixml on end-to-end XPath queries. Its biggest wins are on large files and attribute-dense documents. It falls behind on some query patterns too, and I'll get into where and why.

This post walks through the architecture, the chain of optimisations we found along the way, and the benchmarks. I'll get into some of the maths where it's interesting.

Why XML is harder than JSON

simdjson's core insight is that you can classify structural characters ({, }, [, ], :, ,) in bulk using SIMD, producing a flat tape of positions that a second pass interprets. JSON's structural characters are single bytes with relatively simple quoting rules.

XML is messier. Tag names must be parsed and matched, not just delimited. Attributes create structure within tags, not just between them. Self-closing tags need detection. And quoted attribute values can contain angle brackets, meaning you can't just scan for < and > without understanding quoting context.

The quoting problem is the most interesting one. In JSON, you only have double quotes, and simdjson handles them with a clever prefix-XOR trick. XML has both single and double quotes in attribute values, and they can be nested. There's also the matter of CDATA sections, comments, and processing instructions, none of which exist in JSON.

So the direct port isn't straightforward. But the core idea (SIMD classification of structural characters into bitmasks, followed by a sequential index-building pass) does transfer. It just takes more work.

The two-pass architecture

simdxml uses a two-pass approach:

Pass 1 processes the input in 64-byte chunks using SIMD instructions. On each chunk, it classifies every byte, producing bitmasks for <, >, /, =, ", and ' positions. On AArch64 this uses NEON VCEQ comparisons across four 16-byte vectors per iteration; on x86_64 it uses AVX2 PCMPEQB on two 32-byte vectors, or SSE4.2 as a fallback.

Pass 2 is a sequential scan over those bitmasks. It walks the structural positions and builds flat arrays encoding the document tree: tag byte-offsets, interned tag names, depths, parent pointers, and text ranges. The output is a struct-of-arrays layout using roughly 16 bytes per tag, compared to pugixml's approximately 35 bytes per DOM node.

That second number matters when your XML files are 5 GB.

Quote masking with carry-less multiply

The trickiest part of Pass 1 is figuring out which structural characters are inside quoted attribute values and should therefore be ignored.

simdjson solves this for JSON using a prefix-XOR on the double-quote bitmask. The idea: if you XOR a bitmask of quote positions with an all-ones constant using carry-less multiplication, the result has 1-bits in every position between matching quotes. AND the complement of that mask with your angle-bracket bitmask and you've masked out all quoted regions in one shot.

Concretely, let \(q\) be the bitmask of double-quote positions within a 64-byte chunk. The carry-less product

$$q \otimes \texttt{0xFFFF\ldots} = \operatorname{prefix\text{-}XOR}(q)$$

produces a mask where bits between matching quotes are set. This maps to a single instruction: PCLMULQDQ on x86 or PMULL on ARM NEON.

For XML, there's a complication: both " and ' are valid attribute delimiters, and they can appear inside each other's values. The common case (only " appears, which holds for the vast majority of 64-byte chunks in real XML) is handled identically to simdjson. When both " and ' appear in the same chunk, simdxml falls back to sequential state tracking for that chunk. In practice this fallback fires on less than 1% of chunks across all evaluated datasets. The cost is negligible.

When SIMD isn't worth it

Here's something that surprised me. Not all XML benefits from the SIMD classifier.

The two-stage pipeline (SIMD classification followed by bitmask walking) pays off when there are lots of structural characters per byte of input. Attribute-heavy XML qualifies: every attribute means more ", =, and whitespace to classify in bulk. But text-heavy XML, where you have a handful of tags wrapping large stretches of prose, doesn't benefit as much. The structural characters are sparse, and the overhead of classifying every byte is wasted work when most bytes are just text content.

The memchr crate uses SIMD internally for single-character scanning. For text-heavy XML, just scanning for < positions with memchr and parsing sequentially from each one turns out to be competitive with the full SIMD classifier. Sometimes faster.

simdxml handles this with a heuristic: at document open, it samples the first 4 KB and computes a quote ratio (the fraction of 64-byte chunks containing at least one quote character). If the ratio is below 0.05, it skips the SIMD pipeline entirely and uses the memchr-based path.

The threshold is not sensitive. Varying it from 0.01 to 0.20 changes end-to-end query time by less than 3% across all evaluated datasets. The dispatch decision is made once and applies for the remainder of the parse.

This was a useful lesson. Sometimes the best SIMD optimisation is knowing when not to use SIMD.

Flat arrays, not trees

Before talking about XPath, it's worth understanding what simdxml actually builds. A DOM parser like pugixml allocates a tree of linked nodes, each with pointers to parent, first-child, and next-sibling. This is flexible, but it means pointer chasing during traversal, per-node allocation overhead, and roughly 35 bytes per node.

simdxml builds a set of flat, densely packed arrays:

  • Tag table: arrays of (start, end, name_id, depth, pre, post, parent, flags) tuples. Each entry is a 32-byte record, but since self-closing tags, text nodes, and attributes share entries, the amortised cost comes out around 16 bytes per tag.
  • Parent vector: parent[i] gives the index of tag i's parent. Upward traversal without pointer chasing.
  • Name table: an interned string table mapping each unique tag/attribute name to an integer identifier. A reverse index maps each name ID to its sorted list of tag indices (an inverted posting list).
  • Text segments: byte ranges for text content between tags, stored as (tag_index, start, end) triples.
  • CSR adjacency: parent-child relationships in Compressed Sparse Row format. An offsets array of length n+1 and a children array. The children of tag i occupy children[offsets[i]..offsets[i+1]]. No per-node allocation, cache-friendly sequential access.

This struct-of-arrays layout is what makes everything else possible. It's cache-friendly, it fits in memory for very large documents, and it supports efficient XPath evaluation without ever building a tree.

XPath on flat arrays

Building a structural index is only half the problem. The index needs to support actual XPath queries.

In a DOM, evaluating XPath means traversing pointer-linked tree nodes. In simdxml, the "tree" is a set of flat arrays, and XPath axes map to array operations.

Pre/post-order numbering

Each tag receives a pre-order number (its position in document order) and a post-order number (assigned when its closing tag is encountered). This gives you constant-time ancestor and descendant checks: node \(u\) is an ancestor of node \(v\) if and only if

$$\operatorname{pre}(u) < \operatorname{pre}(v) \quad \text{and} \quad \operatorname{post}(u) > \operatorname{post}(v)$$

No tree walking required. This is a well-known technique from relational XML processing (Grust, 2002), adapted here to in-memory flat arrays.

Inverted posting lists

Each interned tag name maps to a sorted array of tag indices. A name test like //article becomes a scan over the article posting list rather than a traversal of the entire document.

Fused descendant scan

A common XPath pattern is a multi-step descendant path like //article/title. A naive evaluator would first collect all article elements (potentially millions), then for each one enumerate children named title.

simdxml recognises this at compile time and rewrites it as a single scan: iterate over the title posting list and, for each candidate, check whether its nearest ancestor named article exists using the parent vector. This avoids materialising the intermediate article node set and reduces work from \(O(|\texttt{article}| \cdot k)\) to \(O(|\texttt{title}|)\), where \(k\) is the average child count per article.

On DBLP (5.1 GB, 4.2 million articles), this optimisation is the difference between the query being feasible and it timing out.

Fused attribute predicates

A similar optimisation applies to patterns like //DescriptorName[@MajorTopicYN='Y']. Rather than materialising all 330K DescriptorName elements and then filtering by attribute, the attribute check is performed inline during the descendant scan. Each candidate is tested as it's found, keeping the working set small. On PubMed this brings the query from 271 ms down to 132 ms.

Parallel chunk parsing

This is the part I found most surprising. XML parsing has traditionally been considered inherently sequential. You need to track quoting state, nesting depth, and parent-child relationships, all of which depend on everything that came before.

simdjson doesn't parallelise its structural indexing stage either. For JSON, the sequential pass is fast enough. But for multi-gigabyte XML, the parse dominates wall-clock time, and I wanted to see if cores could help.

The key insight is that the SIMD first pass identifies all tag boundary positions. If you can find safe split points between complete elements (positions where a > ends one element and a < begins another), you can parse each chunk independently.

The algorithm:

  1. Find K-1 candidate split points by scanning for > characters at roughly equal intervals.
  2. For each candidate, verify it's actually between elements (not inside a comment, CDATA section, or tag). This is a short backward scan.
  3. Spawn K threads, each parsing its chunk independently. Each produces local tag tables with local offsets.
  4. A sequential merge step concatenates the results and computes global depths, parent pointers, and pre/post numbers.

The expensive work (SIMD classification and bitmask scanning) runs in parallel. The sequential merge (depth and parent computation) is O(n) and fast.

On the 195 MB PubMed dataset (M4 Max), an Amdahl's law decomposition gives a serial fraction \(s = 77\) ms (57% of sequential time) and a parallel fraction \(p = 59\) ms (43%). Fitting:

$$T(n) = s + \frac{p}{n} = 77 + \frac{59}{n} ;\text{ms}$$

The theoretical maximum speedup is \(1/0.57 = 1.75\times\), consistent with the observed \(1.48\times\) at 4 threads. The serial portion comprises the result merge (adjusting parent indices, computing global pre/post numbers, and building the name table).

This approach trades generality for efficiency: it requires well-formed, record-oriented XML (a root element containing many sibling records). That is exactly what ST.36 patent documents, DBLP, PubMed, and most large real-world XML corpora look like.

Chaining optimisations

One thing I enjoyed about this project was how the optimisations chain together. Each one opens the door to the next.

Bloom filter prescan

For batch workloads where you're querying thousands of documents with the same XPath expression, many documents might not contain the target element at all. simdxml optionally constructs a 128-bit Bloom filter over tag names during the first pass. Before building the full structural index, the query's name tests are checked against the filter. Documents that definitely don't contain any relevant tag name are skipped entirely.

The Bloom filter construction adds negligible overhead since tag name positions are already being extracted. With two FNV-1a hash functions on different seeds, the false positive rate is around 2% for 20 unique tag names. Prescan speed is roughly 10 GB/s, near memchr throughput.

Lazy index construction

Not every XPath query needs every index. A simple path like /root/child needs the CSR adjacency and name table, but not pre/post numbers. simdxml's query compiler analyses the XPath AST to determine which structural indices are required and constructs only those. For a query touching a single name at a known depth, pre/post computation can be skipped entirely, saving a full document traversal.

Persistent index files

For corpora queried repeatedly, the structural index can be serialised to an .sxi file. On subsequent queries, the index is memory-mapped directly without reparsing. For DBLP, this turns a 3.2-second parse into near-zero amortised cost.

Benchmarks

All experiments run on an Apple M4 Max (NEON) unless noted. simdxml is compiled with rustc in release mode. Baselines are pugixml 1.14 (compiled with -O2), quick-xml (Rust pull parser), and xmllint from libxml2. x86_64 benchmarks use an AMD Ryzen 9 3950X (AVX2).

Datasets

  • DBLP (5.1 GB): The DBLP computer science bibliography. ~10 million bibliographic records in a single file. Large, deep, tag-heavy.
  • PubMed (195 MB): PubMed abstracts. Mixed tags and text content, representative of biomedical data extraction workloads.
  • Attr-heavy (10 MB): Synthetic dataset with a high ratio of attributes to elements, designed to stress quote masking.

End-to-end XPath query performance

Datasetsxqpugixmlvs. pugixmlxmllintvs. xmllint
DBLP (5.1 GB)3.2 s8.3 s2.6x
PubMed (195 MB)130 ms174 ms1.3x831 ms6.4x
Attr-heavy (10 MB)7.8 ms13.1 ms1.7x138 ms17.7x

DBLP is too large for xmllint to process in reasonable time.

On DBLP, simdxml completes //article (selecting all 4.2 million article records) in 3.2 seconds versus pugixml's 8.3. The speedup comes from three sources: faster SIMD-based parsing, lower memory overhead (eliminating allocation pressure), and the fused descendant scan.

The attribute-heavy result (17.7x over xmllint) shows the SIMD quote-masking path doing its job. The 1.7x over pugixml is consistent across datasets with dense attribute content.

Per-query breakdown on PubMed

This is where things get interesting and honest.

PubMed per-query workload (195 MB, median of 5 runs). Lower is better. simdxml wins on descendant queries (Q1, Q3) and ties or loses on aggregation (Q4) and text predicates (Q5, Q6).

simdxml wins on high-selectivity descendant queries (Q1: //MeshHeading, 1.34x) and attribute predicates (Q3: //DescriptorName[@MajorTopicYN='Y'], 1.20x), where the posting list and fused predicate evaluation beat DOM traversal.

pugixml wins on scalar aggregation (Q4: count(), 0.88x) and text predicates (Q6: contains(), 0.91x). For count, pugixml's single-pass DOM construction skips the structural merge step, making it leaner. For contains, pugixml evaluates predicates inline during its DOM walk without a separate text-extraction pass, which the flat-array architecture can't match.

The deep path (Q5: //PubmedArticle/.../Author, 5 levels) is essentially a tie. CSR child enumeration matches DOM pointer chasing.

I find the per-query breakdown more illuminating than the headline numbers. The architectural trade-offs are real. Flat arrays win on selective traversal and lose on workloads where the DOM's single-pass construction avoids redundant work.

Parse throughput

Parse-only throughput on PubMed (195 MB), M4 Max. Criterion microbenchmark. simdxml builds a full structural index; quick-xml yields pull-parser events only.

Sequential simdxml achieves 1.43 GB/s on the M4 Max, comparable to quick-xml at 1.27 GB/s. This is the important bit: simdxml's sequential throughput is in the same ballpark as a widely-used Rust pull parser despite building a full structural index (tag names, depths, parents, text ranges, CSR adjacency). quick-xml builds nothing; it just yields parse events. The index construction is essentially "free" relative to the raw parsing cost.

With 4 threads, throughput rises to 2.12 GB/s. Sub-linear scaling, as predicted by the Amdahl decomposition above.

Memory

Peak memory overhead during XPath evaluation (RSS). simdxml's flat arrays use roughly half the memory of pugixml's DOM.

On PubMed, simdxml uses roughly 16 bytes per tag versus pugixml's ~35 bytes per node. The savings come from eliminating per-node pointers (replaced by integer indices), interning tag names, and using CSR format (one integer per edge vs. pugixml's first-child/next-sibling pointer pairs).

On DBLP with 227 million tags, simdxml's index fits comfortably at ~3.6 GB. pugixml's DOM needs ~7.9 GB, which on a 16 GB machine doesn't leave much headroom for the query engine.

The DuckDB extension

Part of the motivation for simdxml was DuckDB. We've been building DuckDB-native replacements for parts of our search infrastructure at Amplified, and one piece we needed was fast XPath evaluation over XML stored in Parquet columns.

duckdb-xpath is a DuckDB extension wrapping simdxml. It provides xpath_text, xpath_list, xpath_count, xpath_exists, and xpath_eval as scalar functions, plus a read_xml table function.

The extension uses a three-tier evaluation strategy per row:

  1. Grep mode: for simple //tagname queries, a SIMD-backed QuickScanner does direct string search on raw bytes. No XML parsing at all.
  2. Bloom pre-filter: for complex queries, a Bloom filter prescan rejects documents that can't possibly match.
  3. Full parse: only when the first two tiers don't resolve. A reusable ParseScratch buffer eliminates per-row heap allocation after the first document in each DuckDB chunk.

XPath compilation happens once per chunk, not per row. The Bloom filter targets are computed once and reused. This layering means that for a 10K-document patent corpus, the gap is dramatic:

duckdb-xpath () vs. libxml2-backed baseline () on 10K patent documents. Log scale. Speedup labels show the gap.

The //invention-title query (100% hit rate, simple tag match) runs in 5 ms versus 3.42 seconds. That's the grep-mode fast path: SIMD string search on raw bytes, no parsing at all. As queries get more complex and the full parse path is needed, the gap narrows but remains substantial.

What it can't do (yet)

Some honesty about limitations:

  • DTDs and schemas: not processed. The parser assumes well-formed XML.
  • Text predicates at scale: as the PubMed Q6 benchmark shows, contains() over many text nodes is not a strength. pugixml evaluates predicates during DOM construction; simdxml's flat-array layout requires a separate text-extraction pass.
  • Small files: parallel parsing only pays off above ~64 KB. For small documents, the sequential path is used automatically, but the thread pool setup cost is avoided rather than amortised.
  • count() and aggregation: pugixml's single-pass DOM build has an advantage for queries that do minimal XPath work. The structural merge step in simdxml adds overhead that isn't recovered when the query itself is trivial.

Python bindings

As of today, simdxml is also available as a Python package. Pre-built wheels for Linux (x86_64, aarch64), macOS (arm64, x86_64), and Windows. pip install simdxml.

The Python API has two faces. The native one exposes parse, xpath_text, compiled queries, and the like:

import simdxml

doc = simdxml.parse(xml_bytes)
titles = doc.xpath_text("//title")

# Compiled queries for repeated use
expr = simdxml.compile("//book[@lang='en']/title")
for doc in documents:
    results = expr.eval_text(doc)

There's also a read-only ElementTree-compatible layer for drop-in use with existing code:

from simdxml.etree import ElementTree as ET

tree = ET.parse("books.xml")
root = tree.getroot()
for title in root.findall(".//title"):
    print(title.text)

# Full XPath 1.0 also available
root.xpath("//book[contains(title, 'XML')]")

Elements are immutable (it's a structural index, not a mutable tree), so you get a TypeError if you try to modify them. For read-heavy workloads, though, the numbers against lxml and stdlib are solid.

Benchmarks are on Apple Silicon, Python 3.14, lxml 6.0. GC disabled during timing, 3 warmup + 20 timed iterations, median reported. Three corpus types: data-oriented (product catalogue), document-oriented (PubMed abstracts), config-oriented (Maven POM).

Parsing

simdxml eagerly builds structural indices (CSR, name posting lists) at parse time. lxml builds a DOM tree without precomputed query indices. simdxml front-loads more work so queries are faster. Both numbers are real; the trade-off depends on your workload.

CorpusSizesimdxmllxmlvs lxml
Catalogue (data)17 MB32 ms82 ms2.6x
PubMed (document)17 MB27 ms61 ms2.2x
POM (config)2.1 MB2.7 ms8.3 ms3.1x

XPath queries (returning Elements)

This is the apples-to-apples comparison, both libraries returning Element objects:

XPath query benchmarks returning Elements, simdxml vs lxml. Log scale.

The predicate query (//item[@category="cat5"]) is the standout: 1.6 ms versus lxml's 69 ms. 42x. The fused attribute predicate evaluation pays off even across the Python FFI boundary. Simple descendant queries are 6-28x faster, and even on small config files it's 1.5-3.3x.

Text extraction

For ETL workloads where you just want strings, xpath_text() skips Element object creation entirely:

QueryCorpussimdxmllxmlvs lxml
//nameCatalogue 17 MB1.8 ms37 ms20x
//AbstractTextPubMed 17 MB0.31 ms7.1 ms23x
//artifactIdPOM 2.1 MB0.21 ms2.0 ms10x

This is the path that matters most for our patent data extraction workloads at Amplified. Skip the DOM, skip the Element objects, just give me the text.

Wrapping up

simdxml is about 9,000 lines of Rust with zero runtime dependencies. The compiled CLI binary is 684 KB. It passes all 327 libxml2 XPath tests and the pugixml XPath test suite (the handful of discrepancies are down to ambiguities in the XPath spec, not correctness bugs).

It's available as a Rust crate, a Python package, a DuckDB extension, and a CLI (cargo install simdxml-cli). Source is on GitHub.

XPath 2.0 support is in progress (currently at 80% of the QT3 conformance suite).