pyRXP - the fastest XML parser?

ReportLab are proud to present pyRXP version 0.9, the fastest validating XML parser available for Python, and quite possibly anywhere :-).

RXP is a very fast validating XML parser written by Richard Tobin of the Language Technology Group, Human Communication Research Centre, University of Edinburgh. It complies fully with the W3C test suites (although we have compiled it without Unicode support for the time being). We would like to thank Richard Tobin and Henry Thompson of the Language Technology Group for making this code available to the world.

pyRXP is a wrapper around this which constructs a lightweight in-memory "tuple tree" in a single call. This structure is the lightest one we could define in Python, and it is constructed entirely in C code, resulting in unprecedented speed. It is a core part of ReportLab's forthcoming XML toolkit, which aims to offer simple, fast and pythonic tools for common XML processing tasks.

This is not a full DOM implementation, but we think it will do what 90% of the people want, in 10% of the time. And with validation. Enjoy!

License

pyRXP, like the underlying RXP parser, is available under the GNU General Public License. If you wish to use it in closed-source commercial products, you need to obtain a separate license from us and also from University of Edinburgh; email for more information.

Performance

The stats below are taken from our benchmark script which is in the examples file. The tests were done on a Pentium 1000 with 128Mb RAM, Windows 2000 and Python 2.1.1. We use a 440k XML file with many attributes as well as text content. We'd welcome review and enhancements to the benchmarks, as we are not expert in the other parsers and wish to be open and fair in our benchmarking.

RXP's home page is http://www.cogsci.ed.ac.uk/~richard/rxp.html.

browsable svn tree for pyRXP

To get the source use the commands

svn co http://www.reportlab.co.uk/svn/public/reportlab/trunk/rl_addons/pyRXP

The created directory pyRXP should contain a distutils script, setup.py which should be run with argument install or build. If successful a a shared library pyRXP.pyd or pyRXP.so should be built.

Parser

Validates?

Init time

Parse time

Traverse time

Memory Alloc

Memory Factor

pyRXP

Yes

0.0253

0.1416

0.0617

4176kb

9.63

rparsexml

No

0.2064

0.8827

0.0630

3264kb

7.52

pyexpat

No

0.0071

0.3861

0.0871

5244kb

12.09

cdomlette

No

0.0189

0.3113

????

4524kb

10.43

minidom

No

0.3835

5.9618

????

30264kb

69.76

4dom

No

0.8648

36.5051

????

81904kb

188.8

MSXML 3.0

Yes

2.5 / 0.08

0.2020

???

2452kb

5.65

Java Xerces

Yes

????

1.081

????

7212kb

16.2

 

Definitions

Init time

The time to load any modules needed and initialize a parser object or parse function, given that the bare python process has started. May be relevant to CGI apps; of little importance to long running programs

Parse Time

Time take to parse an in-memory 440kb string into whatever tree structure the parser normally produces

Traverse Time

Time for a traversal of the tree counting the tags and attributes. May expose lazy versus up-front strategies for implementing the tree. The first three parsers are all supposed to be building the same tuple-tree structure, so any differences in traverse time are likely to be random fluctuations, or may indicate bugs in the way they build the tree :-). We need to add the code to traverse al the other structures.

Memory Allocation

The extra memory allocated by parsing the XML file, on top of that for the bare program and the XML source data as an in-memory string.

Memory Factor

Memory as a multiple of the underlying XML file. i.e. for each kilobyte of xml, about how many kb memory are needed in the tree?

Notes on the tests

pyRXP

This is doing a validating parse in one API call. Turning off validation makes little or no difference to the speed.

expat

We wrote a tiny wrapper around Expat (which is in Python 2.0 and higher) to generate the tuple tree. This is included in the examples module. This is midway between pyRXP and rparseml in speed; the limiting factor (as with all SAX-like parsers) is that it calls back into Python code for every tag start, content chunk and tag end. If Expat were extended to make the tree directly we would expect speeds comparable to pyRXP.

rparsexml

This is Aaron Watters' parser using string.find:-) It goes a tad faster than Greg Stein's qp_xml, since it outputs tuples rather than class instances, and ignores some XML features such as external entities. It was written for 1.5.2 and could probably be revved up a little using string methods. We think it's about as fast as you can go in Python.

cdomlette

This is FourThought's tree-builder in C. It has a similar philosophy to what we are doing - it constructs a tree of node objects at high speed, though not quite as fast as pyRXP and without validation. Memory efficiency is slightly worse, but it may be holding more information than our own model.

minidom and 4DOM

minidom is in the standard Python distribution; 4DOM is FourThought's fully compliant DOM implementation. While both of these are correct and very useful pieces of software, they cannot come close to a C-based parser in speed or efficiency.

MS XML 3.0

This was accessed as a COM server from Python. Startup time was about 2.5 seconds from cold, but 0.08 seconds on subsequent runs; MS must do something clever at the OS level. MS has the most compact representation in memory, which is impressive considering that it is a full DOM implementation with links back to the parent nodes.

Java

JBuilder 6.1 with JDK 1.3.1 and the standard Xerces DOM parser that comes with it. We know JVM settings make a difference but don't know enough about Java; we'd welcome advice on the best that can be done on the platform and how to set things up accordingly.

 

 

Documentation

Full documentation for PyRXP.

read the docs
(PDF, 95.4 KB)

 

Binaries

PyRXP as a Windows binary distribution.

download the binary distribution

 

Source Distribution

Source, docs, examples & benchmarks.

download as a zip (373KB)
download as a tgz (348KB)