| |
pyRXP - the fastest XML parser?
ReportLab are proud to present pyRXP version
0.9, the fastest validating XML parser available
for Python, and quite possibly anywhere :-).
RXP is a very fast
validating XML parser written by Richard Tobin of the Language Technology
Group, Human Communication Research Centre, University of Edinburgh. It
complies fully with the W3C test suites (although we have compiled it without
Unicode support for the time being). We would like to thank Richard Tobin
and Henry Thompson of the Language Technology Group for making this code
available to the world.
pyRXP is a wrapper around this which constructs a lightweight in-memory "tuple tree"
in a single call. This structure is the lightest one we could define in Python,
and it is constructed entirely in C code, resulting in unprecedented speed.
It is a core part of ReportLab's forthcoming XML toolkit, which aims to
offer simple, fast and pythonic tools for common XML processing tasks.
This is not a full DOM implementation, but we think it will do what
90% of the people want, in 10% of the time. And with validation. Enjoy!
License
pyRXP, like the underlying RXP parser, is available under the GNU General Public License.
If you wish to use it in closed-source commercial products, you need
to obtain a separate license from us and also from University of Edinburgh;
email for more information.
Performance
The stats below are taken from our benchmark script which is in the examples file.
The tests were done on a Pentium 1000 with 128Mb RAM, Windows 2000 and Python 2.1.1.
We use a 440k XML file with many attributes as well as text content.
We'd welcome review and enhancements to the benchmarks, as we are not expert in the
other parsers and wish to be open and fair in our benchmarking.
RXP's home page is http://www.cogsci.ed.ac.uk/~richard/rxp.html.
browsable svn tree for pyRXP
To get the source use the commands
svn co http://www.reportlab.co.uk/svn/public/reportlab/trunk/rl_addons/pyRXP
The created directory pyRXP should contain a distutils script, setup.py which should be run with argument
install or build. If successful a a shared library pyRXP.pyd or
pyRXP.so should be built.
Parser |
Validates? |
Init time |
Parse time |
Traverse time |
Memory Alloc |
Memory Factor |
pyRXP |
Yes |
0.0253 |
0.1416 |
0.0617 |
4176kb |
9.63 |
rparsexml |
No |
0.2064 |
0.8827 |
0.0630 |
3264kb |
7.52 |
pyexpat |
No |
0.0071 |
0.3861 |
0.0871 |
5244kb |
12.09 |
cdomlette |
No |
0.0189 |
0.3113 |
???? |
4524kb |
10.43 |
minidom |
No |
0.3835 |
5.9618 |
???? |
30264kb |
69.76 |
4dom |
No |
0.8648 |
36.5051 |
???? |
81904kb |
188.8 |
MSXML 3.0 |
Yes |
2.5 / 0.08 |
0.2020 |
??? |
2452kb |
5.65 |
Java Xerces |
Yes |
???? |
1.081 |
???? |
7212kb |
16.2 |
|
Definitions
Init time
The time to load any modules needed and initialize a parser
object or parse function, given that the bare python process has
started. May be relevant to CGI apps; of little importance to
long running programs
Parse Time
Time take to parse an in-memory
440kb string into whatever tree structure the parser normally
produces
Traverse Time
Time for a traversal of the tree counting the tags and
attributes. May expose lazy versus up-front strategies for
implementing the tree. The first three parsers are all supposed
to be building the same tuple-tree structure, so any differences
in traverse time are likely to be random fluctuations, or may
indicate bugs in the way they build the tree :-). We need to
add the code to traverse al the other structures.
Memory Allocation
The extra memory allocated by parsing the XML file, on top
of that for the bare program and the XML source data as an
in-memory string.
Memory Factor
Memory as a multiple of the underlying XML file. i.e. for
each kilobyte of xml, about how many kb memory are needed in the
tree?
Notes on the tests
pyRXP
This is doing a validating parse in one API call. Turning
off validation makes little or no difference to the
speed.
expat
We wrote a tiny wrapper around Expat (which is in Python
2.0 and higher) to generate the tuple tree. This is included in
the examples module. This is midway between pyRXP and rparseml in
speed; the limiting factor (as with all SAX-like parsers) is that
it calls back into Python code for every tag start, content chunk
and tag end. If Expat were extended to make the tree directly we
would expect speeds comparable to pyRXP.
rparsexml
This is Aaron Watters' parser using string.find:-)
It goes a tad faster than Greg Stein's qp_xml, since it outputs
tuples rather than class instances, and ignores some XML features
such as external entities. It was written for 1.5.2 and could
probably be revved up a little using string methods. We think it's
about as fast as you can go in Python.
cdomlette
This is FourThought's tree-builder in C. It has a similar
philosophy to what we are doing - it constructs a tree of node
objects at high speed, though not quite as fast as pyRXP and
without validation. Memory efficiency is slightly worse, but it
may be holding more information than our own model.
minidom and 4DOM
minidom is in the standard Python distribution; 4DOM is
FourThought's fully compliant DOM implementation. While both of
these are correct and very useful pieces of software, they cannot
come close to a C-based parser in speed or efficiency.
MS XML 3.0
This was accessed as a COM server from Python. Startup
time was about 2.5 seconds from cold, but 0.08 seconds on
subsequent runs; MS must do something clever at the OS level. MS
has the most compact representation in memory, which is impressive
considering that it is a full DOM implementation with links back
to the parent nodes.
Java
JBuilder 6.1 with JDK 1.3.1 and the standard Xerces DOM
parser that comes with it. We know JVM settings make a difference
but don't know enough about Java; we'd welcome advice on the best
that can be done on the platform and how to set things up
accordingly.
|