Overview
Abstract
REXML is an XML processor for the language Ruby. REXML is conformant
(passes 100% of the Oasis non-validating tests), and includes full XPath
support. It is reasonably fast, and is implemented in pure Ruby. Best of
all, it has a clean, intuitive API.
This software is distribute under the Ruby
license.
Introduction
Why REXML? There, at the time of this writing, already two XML
parsers for Ruby. The first is a Ruby binding to a native XML parser.
This is a fast parser, using proven technology. However, it isn't
very portable. The second is a native Ruby implementation, and as useful
as it is, it has (IMO) a difficult API.
I have this problem: I dislike obfuscated APIs. There are several XML
parser APIs for Java. Most of them follow DOM or SAX, and are very
similar in philosophy with an increasing number of Java APIs. Namely,
they look like they were designed by theorists who never had to use
their own APIs. The extant XML APIs, in general, suck. They take a
markup language which was specifically designed to be very simple,
elegant, and powerful, and wrap an obnoxious, bloated, and large API
around it. I was always having to refer to the API documentation to do
even the most basic XML tree manipulations; nothing was intuitive, and
almost every operation was complex.
Then along came Electric XML.
Ah, bliss. Look at the Electric XML API. First, the library is small;
less that 500K. Next, the API is intuitive. You want to parse a
document? doc = new Document( some_file ). Create and add a
new element? element = parent.addElement( tag_name ). Write
out a subtree?? element.write( writer ). Now how about DOM?
To parse some file: parser = new DOMParser();
parser.parse( new InputSource( new FileInputStream( some_file ) ) )
Create a new element? First you have to know the owning document of the
to-be-created node (can anyone say "global variables, or obtuse,
multi-argument methods"?) and call element = doc.createElement( tag_name )
parent.appendChild( element ) "appendChild"? Where did they get
that from? How many different methods do we have in Java in how many
different classes for adding children to parents? addElement()?
add()? put()? appendChild()?
Heaven forbid that you want to create an Element elsewhere in the code
without having access to the owning document. I'm not even going to
go into what travesty of code you have to go through to write out an XML
sub-tree in DOM.
So, I use Electric XML extensively. It is small, fast, and intuitive.
IE, the API doesn't add a bunch of work to the task of writing
software. When I started to write more software in Ruby, I needed an XML
parser. I wasn't keen on the native library binding,
"XMLParser", because I try to avoid complex library dependancies
in my software, when I can. For a long time, I used NQXML, because it
was the only other parser out there. However, the NQXML API can be even
more painful than the Java DOM API. Almost all element operations
requires accessing some indirect node access... you had to do something
like element.node.attr['key'], and it is never obvious to me
when you access the element directly, or the node.. or, really, why
they're two different objects, anyway. This is even more unfortunate
since Ruby is so elegent and intuitive, and bad APIs really stand out.
I'm not, by the way, trying to insult NQXML; I just don't like
the API.
I wrote the people at TheMind (Electric XML... get it?) and asked
them if I could do a translation to Ruby. They said yes. After a few
weeks of hacking on it for a couple of hours each week, and after having
gone down a few blind alleys in the translation, I had a working beta.
IE, it parsed, but hadn't gone through a lot of strenuous testing.
Along the way, I had made a few changes to the API, and a lot of changes
to the code. First off, Ruby does iterators differently than Java. Java
uses a lot of helper classes. Helper classes are exactly the kinds of
things that theorists come up with... they look good on paper, but using
them is like chewing glass. You find that you spend 50% of your time
writing helper classes just to support the other 50% of the code that
actually does the job you were trying to solve in the first place. In
this case, the Java helper classes are either Enumerations or Iterators.
Ruby, on the other hand, uses blocks, which is much more elegant. Rather
than:
for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) {
Element child = (Element)e.nextElement();
// Do something with child
}
you get:
parent.each_child{ |child| # Do something with child }
Can't you feel the peace and contentment in this block of code?
Ruby is the language Buddha would have programmed in.
Anyhoo, I chose to use blocks in REXML directly, since this is more
common to Ruby code than for x in y ... end, which is as
orthoganal to the original Java as possible.
Also, I changed the naming conventions to more Ruby-esque method
names. For example, the Java method getAttributeValue()
becomes in Ruby get_attribute_value(). This is a toss-up. I
actually like the Java naming convention more1,
but the latter is more common in Ruby code, and I'm trying to make
things easy for Ruby programmers, not Java programmers.
The biggest change was in the code. The Java version of Electric XML
did a lot of efficient String-array parsing, character by character.
Ruby, however, has ubiquitous, efficient, and powerful regular
expression support. All regex functions are done in native code, so it
is very fast, and the power of Ruby regex rivals that of Perl.
Therefore, a direct conversion of the Java code to Ruby would have been
more difficult, and much slower, than using Ruby regexps. I therefore
used regexs. In doing so, I cut the number of lines of sourcecode by
half.
Finally, by this point the API looks almost nothing like the original
Electric XML API, and practically none of the code is even vaguely
similar. However, even though the actual code is completely different, I
did borrow the same process of processing XML as Electric, and am deeply
indebted to the Electric XML code for inspiration.
One last thing. If you use and like this software, and you feel
compelled to make some contribution to the author by way of saying
"thanks", and you happen to know what a tea cozy is and where to
get them, then you can send me one. Send those puppies to: Sean Russell
60252 Rimfire Rd.
Bend, OR 97702
USA If you're outside of the US, make sure you write "gift"
on it to avoid the taxes. If you don't want to send a tea cozy, you
can also send money. Or don't send anything. Offer me a job I
can't refuse, in Western Europe somewhere.
Features
- Simple API
- Both stream (SAX-like) and tree (DOM-like) parsing2
- Small
- Reasonably fast
- Native Ruby
- Full XPath support
- Fully XML 1.0 conformant
- ISO-8859-1, UNILE, UTF-16 and UTF-8 input and output
- Documentation
Operation
Installation
Run ruby bin/install.rb. By the way, you really should
look at these sorts of files before you run them as root. They could
contain anything, and since (in Ruby, at least) they tend to be
mercifully short, it doesn't hurt to glance over them. If you want
to uninstall REXML, run ruby bin/install.rb -u.
Unit tests
If you have runit or Test::Unit installed (with the runit API), you
can run the unit test cases. You can run both installed and not
installed tests; to run the tests before installing REXML, run
ruby -I. bin/suite.rb. To run them with an installed REXML,
use ruby bin/suite.rb.
Benchmarks
There is a benchmark suite in benchmarks/. To run the
benchmarks, change into that directory and run ruby comparison.rb.
If you have nothing else installed, only the benchmarks for REXML will
be run. However, if you have any of the following installed, benchmarks
for those tools will also be run:
- NQXML
- XMLParser
- Electric XML (you must copy EXML.jar into the
benchmarks directory and compile flatbench.java
before running the test)
The results will be written to index.html.
General Usage
Please see the Tutorial.
The API documentation is available
on-line,
or it can be downloaded as an archive
in tbz2 format (~40Kb),
in tgz format (~70Kb),
or (if you're a masochist) in zip format (~280Kb).
The best solution is to download and install Dave Thomas' most excellent
rdoc and generate the API docs
yourself; then you'll be sure to have the latest API docs and
won't have to keep downloading the doc archive.
The unit tests in test/ and the benchmarking code in
benchmark/ provide
additional examples of using REXML. The Tutorial provides examples with
commentary. The documentation unpacks into rexml/doc.
Status
Speed and Completeness
Unfortunately, NQXML is the only package REXML can be compared
against; XMLParser uses expat, which is a native library, and really is
a different beast altogether. So in comparing NQXML and REXML you can
look at four things: speed, size, completeness, and API.
Benchmarks
REXML is faster than NQXML in some things, and slower than NQXML in a
couple of things. You can see this for yourself by running the supplied
benchmarks. Most of the places where REXML are slower are because of the
convenience methods3. On the
positive side, most of the convenience methods can be bypassed if you
know what you are doing. Check the
benchmark comparison page for a general comparison. You
can look at the benchmark code yourself to decide how much salt to take
with them.
The sizes of the XML parsers are close4. NQXML 1.1.3 has 1580 non-blank, non-comment lines of code;
REXML 2.0 has 23405.
REXML is a mostly conformant XML 1.0 parser. It supports multiple
language encodings, and internal processing uses the required UTF-8 and
UTF-16 encodings. It passes 100% of the Oasis non-validating tests.
Furthermore, it provides a full implementation of XPath.
The last thing is the API, and this is where I think REXML wins. The
core API is clean and intuitive, and things work the way you would
expect them to. Convenience methods abound, and you can code for either
convenience or speed. REXML code is terse, and readable, like Ruby code
should be. The best way to decide which you like more is to write a
couple of small applications in each, then use the one you're more
comfortable with.
XPath
As of release 2.0, XPath 1.0 is fully implemented.
I fully expect bugs to crop up from time to time, so if you see any
bogus XPath results, please let me know. That said, since I'm now
following the XPath grammar and spec fairly closely, I suspect that you
won't be surprised by REXML's XPath very often, and it should
become rock solid fairly quickly.
Check the "bugs" section for known problems; there are little
bits of XPath here and there that are not yet implemented, but I'll
get to them soon.
Namespace support is rather odd, but it isn't my fault. I can
only do so much and still conform to the specs. In particular, XPath
attempts to help as much as possible. Therefore, in the trivial cases,
you can pass namespace prefixes to Element.elements[...] and so on -- in
these cases, XPath will use the namespace environment of the base
element you're starting your XPath search from. However, if you want
to do something more complex, like pass in your own namespace
environment, you have to use the XPath first(), each(), and match()
methods. Also, default namespaces force you to use the XPath
methods, rather than the convenience methods, because there is no way
for XPath to know what the mappings for the default namespaces should
be. This is exactly why I loath namespaces -- a pox on the person(s) who
thought them up!
Namespaces
Namespace support is now fairly stable. One thing to be aware of is
that REXML is not (yet) a validating parser. This means that some
invalid namespace declarations are not caught.
Mailing list
There is a low-volume mailing list dedicated to REXML. To subscribe,
send an empty email to ser-rexml-subscribe@germane-software.com.
This list is more or less spam proof. To unsubscribe, similarly send a
message to ser-rexml-unsubscribe@germane-software.com.
RSS
An RSS file
for REXML is now being generated from the change log. This allows you to
be alerted of upgrades via 'pull' as they become available, if
you have an RSS browser. This is an abuse of the RSS mechanism, which
was intended to be a distribution system for headlines linked back to
full articles, but it works. The headline for REXML is the version
number, and the description is the change log. The links all link back
to the REXML home page. The URL for the RSS itself is
http://www.germane-software.com/software/rexml/rss.xml
Applications that use REXML
- Hiroshi NAKAMURA's SOAP4R
package can use REXML as the XML processor.
- Chris Morris' XML
Serializer. XML Serializer provides a serialization mechanism
for Ruby that provides a bidirectional mapping between Ruby classes
and XML documents.
- Much of the RubyXML
site is generated with scripts that use REXML. RubyXML is a great
place to find information about th intersection between Ruby and XML.
-
Jelly
is a generic utility for generating Ruby libs (XML writers) from W3C
XML schemas.
Change Log
-
2.2.1: Fixed a bug in benchmark/bench.rb that kept it
from running. Added standalone?() to XMLDecl as an alias for the
standalone accessor. Improvements to the streaming API; in
particular, pulling data from non-closing streams doesn't require
passing a block size of 1 to the IOSource class any longer; in
fact, the block size is ignored. Added a user-supplied patch
to fix the fact that not all of the DTD events were getting
passed to the listener.
-
2.1.3: Fixed broken links in documentation. Added new
documentation layout; the old format -- everything on one page -- was
getting a bit overwhelming. Added RSS for changelog. Bugfix for element
cloning namespace loss. The Streaming API wasn't normalizing input
strings; this has been fixed. Added support for deep cloning via
Parent.deep_clone(). Fixed some streaming issues for SOAP4RUBY. In
particular, text normalization is now also done for the Streaming API.
'\r' handling is now correct, as per the XML spec, and entities are
handled better. is now converted to '\r' internally, and then
translated back to '\r' on output. All other numeric entities
(&#nnn; and &#xnnn;) are now converted to unicode on input,
but are only converted back to entities if they don't fit in the
requested encoding.
-
2.1.2: Fixed a bug with reading ISO-8859-1 encoded
documents, and Document now includes Output, which it always should
have.
-
2.1.1: Forgot to add output.rb to the repository.
-
2.1.0: IO optimizations, and support for ISO-8859-1
output. Fixed up pretty-printing a little. Now, if pretty-printing is
turned on, text nodes are stripped before printing. This, obviously, can
mess up what you'd expect from :respect_whitespace, but pretty
printing, by definition, must change your formatting. Updated the
tutorial a bit. Please see the section on adding text for a warning, if
you're using a non-UTF-8 compatable encoding. Changed behavior of
Element.attributes.each. It now itterates over key, value
pairs, rather than attributes. This was a feature request. Expanded the
unit tests and subsequently fixed a number of obscure bugs. I'm
distributing the API documentation seperately from the main distribution
now, because the API docs constitute nearly 50% of the total
distribution size. FIxed a bug in namespace handling in attributes.
Completely updated the API documentation for Element, Element.Elements,
and Element.Attributes; the rest of the classes to follow. I'm
seriously contemplating removing the examples from the API
documentation, because most of them are practically duplicates of the
unit tests in test/.
-
2.0.4: 2.0 munged the encoding value in output. This is
fixed. I left debugging turned on in XPath in 2.0.2 :-/
-
2.0.2: Added grouping '(...)' and preceding:: and
following:: axis. This means that, aside from functional bugs, XPath
should have no missing functionality bugs. Keep in mind that not all
Functions are tested, though.
-
2.0.1: Added some unit tests, and fixed a namespace XPath
bug WRT attribute default NS's. Unicode support was screwing up the
upper end of ASCII support; chars between 0xF0 and 0xFD were getting
munged. This has been fixed, at the cost of a small amount of speed.
Optimized the descendant axes of XPath; it should be significantly
faster for '//' and other descendant operations. Added several
user contributed unit tests. Re-added QuickPath, the old,
non-fully-XPath compliant, yet much faster, XPath processor. Everything
is being converted to UTF8 now, and the XML declaration reflects this.
See the bugs for more information.
-
2.0: True XPath support. Finally. XPath is fully
implemented now, and passes all of the tests I can throw at it,
including complex XPaths such as '*[* and not(*/node()) and
not(*[not(@style)]) and not(*/@style != */@style)]'. It may
be slower than it was, but it should be reasonably efficient for what it
is doing. The XPath spec doesn't help, and thwarts most attempts at
optimization. Please see the notes on XPath for more information. Oh,
and some minor bugs were fixed in the XML parser.
-
1.2.8: Fixed a bug pointed out by Peter Verhage where the
element names weren't being properly parsed if a namespace was
involved.
-
1.2.7: Fixing problems with the 1.2.6 distribution :-/.
Added an "applications using REXML" section in this document --
send me those links! Added rdoc documentation. I'm not using API2XML
anymore. I think API2XML was the right model, generating XML rather than
HTML (which is what rdoc does), but rdoc does a much better job at
parsing Ruby source, and I really didn't want to go there in the
first place. Also, I had forgotten to generate the Tutorial HTML.
-
1.2.6: Documentation fix (TR). Fixed a bug in Element.add
(and, therefore, Element.add_element). Added Robert Feldt's terse
xml constructor to contrib/ (check it out; it's handy). Tobias
discovered a terrible bug, whereby ENTITY wasn't printing out a
final '>'. After a long discussion with a couple of users,
and some review of the XML spec, I decided to reverse the default
handling of whitespace and pretty printing. REXML now no longer defaults
to pretty printing, and preserves whitespace unless otherwise directed.
Added provisional namespace support to XPath. XPath is going to require
another rewrite.
-
1.2.5: Bug fixes: doctypes that had spaces between the
closing ] and > generated errors. There was a small bug that caused
too many newlines to be generated in some output. Eelis van der Weegen
(what a great name!) pointed out one of the numerous API errors. Julian
requested that add_attributes take both Hash (original) and array of
arrays (as produced by StreamListener). I killed the mailing list,
accidentally, and fixed it again. Fixed a bug in next_sibling, caused by
a combination of mixing overriding <=>() and using
Array.index().
-
1.2.4: Changes since 1.1b: 100% OASIS valid tests passed.
UTF-8/16 support. Many bug fixes. to_a() added to Parent and
Element.elements. Updated tutorial. Added variable IOSource buffer size,
for stream parsing. delete() now fails silently rather than throwing an
exception if it can't find the elemnt to delete. Added a patch to
support REXMLBuilder. Reorganized file layout in distribution; added a
repackaging program; added the logo.
-
1.1b: Changes since 1.1a: Stream parsing added. Bug fixes
in entity parsing. New XPath implementation, fixing many bugs and making
feature complete. Completed whitespace handling, adding much
functionality and fixing several bugs. Added convenience methods for
inserting elememnts. Improved error reporting. Fixed attribute content
to correctly handle quotes and apostrophes. Added mechanisms for
handling raw text. Cleaned up utility programs (profile.rb,
comparison.rb, etc.). Improved speed a little. Brought REXML up to 98.9%
OASIS valid source compliance.
Known Bugs
Please send me bug reports. If you really want your bug fixed fast,
include an runit or Test::Unit method (or methods) that illustrates the
problem. You don't have to send me an entire suite; all bug
submissions go into test/contrib_test.rb. If you don't send me a
unit test, I'll have to write one myself, which will mean that your
bug will take longer to fix.
When submitting bug reports, please include the version of Ruby and
of REXML that you're using, and the operating system you're
running on. Just run: ruby -vrrexml/rexml -e 'p
REXML::Version,PLATFORM' and paste the results in your bug
report.
- Some of the XPath functions are untested. Any XPath
functions that don't work are also bugs... please report them to me.
Let me know if you find any bugs, and I'll fix them eventually. If
you send me a unit test that illustrates the problem, I'll try to
fix the problem within a couple of days (if I can) and send you a patch,
personally.
- Sometimes the test suite hangs or segfaults the Ruby interpreter.
If this is something that I can fix, then it is a bug, and I will fix
it.
To Do
- Entity in doctype handling
- Change the stream listener API, (1) to look more like SAX, and
(2) to allow filtering.
- True DTD handling
- Bug report submission mechanism
- Cause XPath to break when it finds the first real XPath match when
the user has called first(). This will speed up XPath in some cases, but
not all.
- RFC/RCR on the REXML page
- Add a default listener that constructs trees based on an event
map. NQMXML does something like this: nd = NQXML::Dispatcher.new(file)
nd.handle(:start_element, %w(root level1 level2)) {| e | # do something with e }
nd.handle(:text, %w(root level1 level2 level3)) {| e | # reads text inside <level3> tag }
nd.start() I'd want it to look similar; basically, the user passes
a set of tags to the parser which instructs the Stream Listener to build
sub-trees for. When a sub-tree is finished being built, some event is
triggered.
- Allow the user to add entity conversions
- I'd like to hack XMLRPC4R to use REXML, for my own purposes.
-
Deep clone for elements
-
Change Element.attributes.each to iterate over
key,Attribute.value pairs, rather than key,Attribute pairs.
-
ISO-8859-1 output support
Requested features
- Process entity declarations in DocType.
FAQ
Hey! REXML is untainting my strings!
Yes, it is, but not intentionally. REXML relies on String.unpack() and
Array.pack() to do encoding conversions, and in this process, tainting is
lost. If you have a really good reason why REXML should preserve
this attribute, at the cost of some speed, let me know and I'll
consider your argument.
Why is Element.elements indexed off of '1' instead of
'0'?
Because of XPath. The XPath specification states that the index of the
first child node is '1'. Although it may be counter-intuitive to
base elements on 1, it is more undesireable to have element.elements[0] ==
element.elements[ 'node()[1]' ]. Since I can't change the
XPath specification, the result is that Element.elements[1] is the first
child element.
Why isn't REXML a validating parser?
Because validating parsers must include code that parses and interprets
DTDs. I hate DTDs. REXML supports the barest minimum of DTD parsing, and
even that isn't complete. There is DTD parsing code in the works, but
I only work on it when I'm really, really bored. Rumor has it that a
contributor is working on a DTD parser for REXML; rest assured that any
such contribution will be included with REXML as soon as it is available.
I'm trying to create an ISO-8859-1 document, but when I add text to
the document it isn't being properly encoded.
Regardless of what the encoding of your document is, when you add text
programmatically to a REXML document you must ensure that you are
only adding UTF-8 to the tree. In particular, you can't add ISO-8859-1
encoded text that contains characters above 0x80 to REXML trees -- you
must convert it to UTF-8 before doing so. Luckily, this is easy:
text.unpack('C*').pack('U*') will do the
trick. 7-bit ASCII is identical to UTF-8, so you probably won't need
to worry about this.
Credits
I've had help from a number of resources; if I haven't listed
you here, it means that I just haven't gotten around to adding you, or
that I'm a dork and have forgotten. In either case, feel free to write
me and complain. I may ignore you, but at least you tried. (Actually, I
don't conciously ignore anybody except spammers.)
- Michael Granger supplied a patch for REXML that make the unit
tests pass under Ruby 1.7.
- Stefan Scholl, who provided a lot of feedback and bug reports
while I was trying to get ISO-8859-1 support working.
- Steven E Lumos for volunteering information about XPath
particulars.
- Erik Terpstra heard my pleas and submitted several logos for
REXML. After sagely avoiding choosing one for several weeks, I finally
forced my poor slave of a wife to pick one (this is what we call
"delegation"). She did, with caveats; Erik quickly made the
changes, and the result is what you now see at the top of this page. He
also supplied a smaller version
that you can include with your projects that use REXML, if you'd
like.
- Bug fixes provided by Fumitoshi UKAI (CData metacharacter quoting
bug)
- Oliver M . Bolzer is maintaining a Debian package distribution of
REXML. He also has provided good feedback and bug reports about
namespace support.
- Ernest Ellingson contributed the sourcecode for turning UTF16 and
UNILE encodings into UTF8, which allowed REXML to get the 100% OASIS
valid tests rating.
- TAKAHASHI Masayoshi, for information on UTF
- James Britt contributed code that makes using
Document.parse_stream easier to use by allowing it to be passed either a
Source, File, or String.
-
Electric
XML: This was, after all, the inspiration for REXML. Originally,
I was just going to do a straight port, and although REXML doesn't
in any way, shape or form resemble Electric XML, still the basic
framework and philosophy was inspired by E-XML. And I still use E-XML in
my Java projects.
- Tobias Reif: Numerous bug reports, and suggestions for
improvement.
-
NQXML:
While I may complain about the NQXML API, I wrote a few applications
using it that wouldn't have been written otherwise, and it was very
useful to me. It also encouraged me to write REXML. Never complain about
free software *slap*.
- Robert Feldt: Bug reports and suggestions/recommendations about
improving REXML. Testing is one of the most important aspects of
software development.
- Many, many other people who've submitted bug reports,
suggestions, and positive feedback. You're all co-developers!