2.2.1

Stable release (2.0.4, unix)
Stable release (2.0.4, dos)
Current release (2.2.1, unix)
Current release (2.2.1, dos)

Overview

Abstract

REXML is an XML processor for the language Ruby. REXML is conformant (passes 100% of the Oasis non-validating tests), and includes full XPath support. It is reasonably fast, and is implemented in pure Ruby. Best of all, it has a clean, intuitive API.

This software is distribute under the Ruby license.

Introduction

Why REXML? There, at the time of this writing, already two XML parsers for Ruby. The first is a Ruby binding to a native XML parser. This is a fast parser, using proven technology. However, it isn't very portable. The second is a native Ruby implementation, and as useful as it is, it has (IMO) a difficult API.

I have this problem: I dislike obfuscated APIs. There are several XML parser APIs for Java. Most of them follow DOM or SAX, and are very similar in philosophy with an increasing number of Java APIs. Namely, they look like they were designed by theorists who never had to use their own APIs. The extant XML APIs, in general, suck. They take a markup language which was specifically designed to be very simple, elegant, and powerful, and wrap an obnoxious, bloated, and large API around it. I was always having to refer to the API documentation to do even the most basic XML tree manipulations; nothing was intuitive, and almost every operation was complex.

Then along came Electric XML.

Ah, bliss. Look at the Electric XML API. First, the library is small; less that 500K. Next, the API is intuitive. You want to parse a document? doc = new Document( some_file ). Create and add a new element? element = parent.addElement( tag_name ). Write out a subtree?? element.write( writer ). Now how about DOM? To parse some file: parser = new DOMParser(); parser.parse( new InputSource( new FileInputStream( some_file ) ) ) Create a new element? First you have to know the owning document of the to-be-created node (can anyone say "global variables, or obtuse, multi-argument methods"?) and call element = doc.createElement( tag_name ) parent.appendChild( element ) "appendChild"? Where did they get that from? How many different methods do we have in Java in how many different classes for adding children to parents? addElement()? add()? put()? appendChild()? Heaven forbid that you want to create an Element elsewhere in the code without having access to the owning document. I'm not even going to go into what travesty of code you have to go through to write out an XML sub-tree in DOM.

So, I use Electric XML extensively. It is small, fast, and intuitive. IE, the API doesn't add a bunch of work to the task of writing software. When I started to write more software in Ruby, I needed an XML parser. I wasn't keen on the native library binding, "XMLParser", because I try to avoid complex library dependancies in my software, when I can. For a long time, I used NQXML, because it was the only other parser out there. However, the NQXML API can be even more painful than the Java DOM API. Almost all element operations requires accessing some indirect node access... you had to do something like element.node.attr['key'], and it is never obvious to me when you access the element directly, or the node.. or, really, why they're two different objects, anyway. This is even more unfortunate since Ruby is so elegent and intuitive, and bad APIs really stand out. I'm not, by the way, trying to insult NQXML; I just don't like the API.

I wrote the people at TheMind (Electric XML... get it?) and asked them if I could do a translation to Ruby. They said yes. After a few weeks of hacking on it for a couple of hours each week, and after having gone down a few blind alleys in the translation, I had a working beta. IE, it parsed, but hadn't gone through a lot of strenuous testing. Along the way, I had made a few changes to the API, and a lot of changes to the code. First off, Ruby does iterators differently than Java. Java uses a lot of helper classes. Helper classes are exactly the kinds of things that theorists come up with... they look good on paper, but using them is like chewing glass. You find that you spend 50% of your time writing helper classes just to support the other 50% of the code that actually does the job you were trying to solve in the first place. In this case, the Java helper classes are either Enumerations or Iterators. Ruby, on the other hand, uses blocks, which is much more elegant. Rather than:

for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) {
   Element child = (Element)e.nextElement();
   // Do something with child
}

you get:

parent.each_child{ |child| # Do something with child }

Can't you feel the peace and contentment in this block of code? Ruby is the language Buddha would have programmed in.

Anyhoo, I chose to use blocks in REXML directly, since this is more common to Ruby code than for x in y ... end, which is as orthoganal to the original Java as possible.

Also, I changed the naming conventions to more Ruby-esque method names. For example, the Java method getAttributeValue() becomes in Ruby get_attribute_value(). This is a toss-up. I actually like the Java naming convention more1, but the latter is more common in Ruby code, and I'm trying to make things easy for Ruby programmers, not Java programmers.

The biggest change was in the code. The Java version of Electric XML did a lot of efficient String-array parsing, character by character. Ruby, however, has ubiquitous, efficient, and powerful regular expression support. All regex functions are done in native code, so it is very fast, and the power of Ruby regex rivals that of Perl. Therefore, a direct conversion of the Java code to Ruby would have been more difficult, and much slower, than using Ruby regexps. I therefore used regexs. In doing so, I cut the number of lines of sourcecode by half.

Finally, by this point the API looks almost nothing like the original Electric XML API, and practically none of the code is even vaguely similar. However, even though the actual code is completely different, I did borrow the same process of processing XML as Electric, and am deeply indebted to the Electric XML code for inspiration.

One last thing. If you use and like this software, and you feel compelled to make some contribution to the author by way of saying "thanks", and you happen to know what a tea cozy is and where to get them, then you can send me one. Send those puppies to: Sean Russell 60252 Rimfire Rd. Bend, OR 97702 USA If you're outside of the US, make sure you write "gift" on it to avoid the taxes. If you don't want to send a tea cozy, you can also send money. Or don't send anything. Offer me a job I can't refuse, in Western Europe somewhere.

Features

Operation

Installation

Run ruby bin/install.rb. By the way, you really should look at these sorts of files before you run them as root. They could contain anything, and since (in Ruby, at least) they tend to be mercifully short, it doesn't hurt to glance over them. If you want to uninstall REXML, run ruby bin/install.rb -u.

Unit tests

If you have runit or Test::Unit installed (with the runit API), you can run the unit test cases. You can run both installed and not installed tests; to run the tests before installing REXML, run ruby -I. bin/suite.rb. To run them with an installed REXML, use ruby bin/suite.rb.

Benchmarks

There is a benchmark suite in benchmarks/. To run the benchmarks, change into that directory and run ruby comparison.rb. If you have nothing else installed, only the benchmarks for REXML will be run. However, if you have any of the following installed, benchmarks for those tools will also be run:

The results will be written to index.html.

General Usage

Please see the Tutorial.

The API documentation is available on-line, or it can be downloaded as an archive in tbz2 format (~40Kb), in tgz format (~70Kb), or (if you're a masochist) in zip format (~280Kb). The best solution is to download and install Dave Thomas' most excellent rdoc and generate the API docs yourself; then you'll be sure to have the latest API docs and won't have to keep downloading the doc archive.

The unit tests in test/ and the benchmarking code in benchmark/ provide additional examples of using REXML. The Tutorial provides examples with commentary. The documentation unpacks into rexml/doc.

Status

Speed and Completeness

Unfortunately, NQXML is the only package REXML can be compared against; XMLParser uses expat, which is a native library, and really is a different beast altogether. So in comparing NQXML and REXML you can look at four things: speed, size, completeness, and API.

Benchmarks

REXML is faster than NQXML in some things, and slower than NQXML in a couple of things. You can see this for yourself by running the supplied benchmarks. Most of the places where REXML are slower are because of the convenience methods3. On the positive side, most of the convenience methods can be bypassed if you know what you are doing. Check the benchmark comparison page for a general comparison. You can look at the benchmark code yourself to decide how much salt to take with them.

The sizes of the XML parsers are close4. NQXML 1.1.3 has 1580 non-blank, non-comment lines of code; REXML 2.0 has 23405.

REXML is a mostly conformant XML 1.0 parser. It supports multiple language encodings, and internal processing uses the required UTF-8 and UTF-16 encodings. It passes 100% of the Oasis non-validating tests. Furthermore, it provides a full implementation of XPath.

The last thing is the API, and this is where I think REXML wins. The core API is clean and intuitive, and things work the way you would expect them to. Convenience methods abound, and you can code for either convenience or speed. REXML code is terse, and readable, like Ruby code should be. The best way to decide which you like more is to write a couple of small applications in each, then use the one you're more comfortable with.

XPath

As of release 2.0, XPath 1.0 is fully implemented.

I fully expect bugs to crop up from time to time, so if you see any bogus XPath results, please let me know. That said, since I'm now following the XPath grammar and spec fairly closely, I suspect that you won't be surprised by REXML's XPath very often, and it should become rock solid fairly quickly.

Check the "bugs" section for known problems; there are little bits of XPath here and there that are not yet implemented, but I'll get to them soon.

Namespace support is rather odd, but it isn't my fault. I can only do so much and still conform to the specs. In particular, XPath attempts to help as much as possible. Therefore, in the trivial cases, you can pass namespace prefixes to Element.elements[...] and so on -- in these cases, XPath will use the namespace environment of the base element you're starting your XPath search from. However, if you want to do something more complex, like pass in your own namespace environment, you have to use the XPath first(), each(), and match() methods. Also, default namespaces force you to use the XPath methods, rather than the convenience methods, because there is no way for XPath to know what the mappings for the default namespaces should be. This is exactly why I loath namespaces -- a pox on the person(s) who thought them up!

Namespaces

Namespace support is now fairly stable. One thing to be aware of is that REXML is not (yet) a validating parser. This means that some invalid namespace declarations are not caught.

Mailing list

There is a low-volume mailing list dedicated to REXML. To subscribe, send an empty email to ser-rexml-subscribe@germane-software.com. This list is more or less spam proof. To unsubscribe, similarly send a message to ser-rexml-unsubscribe@germane-software.com.

RSS

An RSS file for REXML is now being generated from the change log. This allows you to be alerted of upgrades via 'pull' as they become available, if you have an RSS browser. This is an abuse of the RSS mechanism, which was intended to be a distribution system for headlines linked back to full articles, but it works. The headline for REXML is the version number, and the description is the change log. The links all link back to the REXML home page. The URL for the RSS itself is http://www.germane-software.com/software/rexml/rss.xml

Applications that use REXML

Change Log

Known Bugs

Please send me bug reports. If you really want your bug fixed fast, include an runit or Test::Unit method (or methods) that illustrates the problem. You don't have to send me an entire suite; all bug submissions go into test/contrib_test.rb. If you don't send me a unit test, I'll have to write one myself, which will mean that your bug will take longer to fix.

When submitting bug reports, please include the version of Ruby and of REXML that you're using, and the operating system you're running on. Just run: ruby -vrrexml/rexml -e 'p REXML::Version,PLATFORM' and paste the results in your bug report.

To Do

Requested features

FAQ

Hey! REXML is untainting my strings!
Yes, it is, but not intentionally. REXML relies on String.unpack() and Array.pack() to do encoding conversions, and in this process, tainting is lost. If you have a really good reason why REXML should preserve this attribute, at the cost of some speed, let me know and I'll consider your argument.
Why is Element.elements indexed off of '1' instead of '0'?
Because of XPath. The XPath specification states that the index of the first child node is '1'. Although it may be counter-intuitive to base elements on 1, it is more undesireable to have element.elements[0] == element.elements[ 'node()[1]' ]. Since I can't change the XPath specification, the result is that Element.elements[1] is the first child element.
Why isn't REXML a validating parser?
Because validating parsers must include code that parses and interprets DTDs. I hate DTDs. REXML supports the barest minimum of DTD parsing, and even that isn't complete. There is DTD parsing code in the works, but I only work on it when I'm really, really bored. Rumor has it that a contributor is working on a DTD parser for REXML; rest assured that any such contribution will be included with REXML as soon as it is available.
I'm trying to create an ISO-8859-1 document, but when I add text to the document it isn't being properly encoded.
Regardless of what the encoding of your document is, when you add text programmatically to a REXML document you must ensure that you are only adding UTF-8 to the tree. In particular, you can't add ISO-8859-1 encoded text that contains characters above 0x80 to REXML trees -- you must convert it to UTF-8 before doing so. Luckily, this is easy: text.unpack('C*').pack('U*') will do the trick. 7-bit ASCII is identical to UTF-8, so you probably won't need to worry about this.

Credits

I've had help from a number of resources; if I haven't listed you here, it means that I just haven't gotten around to adding you, or that I'm a dork and have forgotten. In either case, feel free to write me and complain. I may ignore you, but at least you tried. (Actually, I don't conciously ignore anybody except spammers.)

1) This is no longer true. I'm a convert to the Ruby naming scheme, for Ruby. The reason being that Ruby does a superb job of hiding the difference between attributes and methods; in fact, for all intents and purposes, you can't access attributes directly; all attribute accessors are methods. What this means in the long run is that there is no reason to have different naming conventions for attributes and methods.
2) Be aware, however, that REXML is neither DOM nor SAX compliant, and will never be. The DOM and SAX APIs are unwieldy.
3) For example, element.elements[index] isn't really an array operation; index can be an Integer or an XPath, and this feature is relatively time expensive.
4) As measured with ruby -nle 'print unless /^\s*(#.*|)$/' *.rb | wc -l
5) REXML started out with about 1200, but that number has been steadily increasing as features are added. XPath accounts for 541 lines of that code, so the core REXML has about 1800 LOC.