Chapter 5. XML Processing

Table of Contents

5.1. Diving in

This chapter assumes that you are familiar with XML in general, that you can construct a valid XML file by hand, and that you can read a DTD without throwing up. Being a philosophy major is not required, although if you have ever been subjected to the writings of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer science.

There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each tag it finds. (If you read HTML Processing, this should sound familiar, because that's how the sgmllib module works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML file at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM.

The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an XML format.

Example 5.1. kgp.py

If you have not already done so, you can download this and other examples used in this book.

"""Kant Generator for Python

Generates mock philosophy based on a context-free grammar

Usage: python kgp.py [options] [string...]

Options:
  -g ..., --grammar=...   use specified grammar file or URL
  -s ..., --source=...    parse specified source file or URL instead of string
  -w #, --wrap=#          hard wrap output to # characters per line
  -h, --help              show this help
  -d                      show debugging information while parsing

Examples:
  kgp.py                  generates several paragraphs of Kantian philosophy
  kpg.py -w 72 paragraph  generate a paragraph of Kant, wrap to 72 characters
  kgp.py -g husserl.xml   generates several paragraphs of Husserl
  kgp.py -s template.xml  reads from template.xml to decide what to generate
"""
from xml.dom import minidom
import random
import toolbox
import sys
import getopt

_debug = 0

class KantGenerator:
    """generates mock philosophy based on a context-free grammar"""

    def __init__(self, grammar=None, source=None):
        self.refs = {}
        self.defaultSource = None
        self.pieces = []
        self.capitalizeNextWord = 0
        self.loadGrammar(grammar)
        if not source:
            source = self.defaultSource
        self.loadSource(source)
        self.refresh()

    def loadGrammar(self, grammar):
        """load context-free grammar
        
        grammar can be
        - a URL of a remote XML file ("http://diveintopython.org/kant.xml")
        - a filename of a local XML file ("/a/diveintopython/common/py/kant.xml")
        - the actual grammar, as a string
        """
        sock = toolbox.openAnything(grammar)
        self.grammar = minidom.parse(sock).documentElement
        sock.close()
        self.refs = {}
        for ref in self.grammar.getElementsByTagName("ref"):
            self.refs[ref.attributes["id"].value] = ref
        xrefs = {}
        for xref in self.grammar.getElementsByTagName("xref"):
            xrefs[xref.attributes["id"].value] = 1
        xrefs = xrefs.keys()
        standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
        if standaloneXrefs:
            self.defaultSource = '<xref id="%s"/>' % random.choice(standaloneXrefs)
        else:
        self.defaultSource = None
        
    def loadSource(self, source):
        """load source
        
        source can be
        - a URL of a remote XML file ("http://diveintopython.org/section.xml")
        - a filename of a local XML file ("/a/diveintopython/common/py/section.xml")
        - the actual XML to parse, as a string ("<xref id='section'/>")
        """
        sock = toolbox.openAnything(source)
        self.source = minidom.parse(sock).documentElement
        sock.close()

    def reset(self):
        """reset parser"""
        self.pieces = []
        self.capitalizeNextWord = 0

    def refresh(self):
        """reset output buffer and re-parse entire source file
        
        Since parsing involves a good deal of randomness, this is an
        easy way to get new output without having to reload a grammar file
        each time.
        """
        self.reset()
        self.parse(self.source)
        return self.output()

    def output(self):
        """output generated text"""
        return "".join(self.pieces)

    def randomChildElement(self, node):
        """choose a random child element of a node
        
        This is a utility method used by parse_xref and parse_choice.
        """
        def isElement(e):
            return isinstance(e, minidom.Element)
        choices = filter(isElement, node.childNodes)
        chosen = random.choice(choices)
        if _debug:
            print '%s available choices:' % len(choices), [e.toxml() for e in choices]
            print 'Chosen:', chosen.toxml()
        return chosen

    def parse(self, node):
        """parse a single XML node
        
        A parsed XML document (from minidom.parse) is a tree of nodes
        of various types.  Each node is represented by an instance of the
        corresponding Python class (Element for a tag, Text for
        text data, Document for the top-level document).  The following
        statement constructs the name of a class method based on the type
        of node we're parsing ("parse_Element" for an Element node,
        "parse_Text" for a Text node, etc.) and then calls the method.
        """
        parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
        parseMethod(node)

    def parse_Document(self, node):
        """parse the document node
        
        The document node by itself isn't interesting (to us), but
        its only child, node.documentElement, is: it's the root node
        of the grammar.
        """
        self.parse(node.documentElement)

    def parse_Text(self, node):
        """parse a text node
        
        The text of a text node is usually added to the output buffer
        verbatim.  The one exception is that <p class='sentence'> sets
        a flag to capitalize the first letter of the next word.  If
        that flag is set, we capitalize the text and reset the flag.
        """
        text = node.data
        if self.capitalizeNextWord:
            self.pieces.append(text[0].upper())
            self.pieces.append(text[1:])
            self.capitalizeNextWord = 0
        else:
            self.pieces.append(text)

    def parse_Element(self, node):
        """parse an element
        
        An XML element corresponds to an actual tag in the source:
        <xref id='...'>, <p chance='...'>, <choice>, etc.
        Each element type is handled in its own method.  Like we did in
        parse(), we construct a method name based on the name of the
        element ("do_xref" for an <xref> tag, etc.) and
        call the method.
        """
        handlerMethod = getattr(self, "do_%s" % node.tagName)
        handlerMethod(node)

    def parse_Comment(self, node):
        """parse a comment
        
        The grammar can contain XML comments, but we ignore them
        """
        pass
    
    def do_xref(self, node):
        """handle <xref id='...'> tag
        
        An <xref id='...'> tag is a cross-reference to a <ref id='...'>
        tag.  <xref id='sentence'/> evaluates to a randomly chosen child of
        <ref id='sentence'>.
        """
        id = node.attributes["id"].value
        self.parse(self.randomChildElement(self.refs[id]))

    def do_p(self, node):
        """handle <p> tag
        
        The <p> tag is the core of the grammar.  It can contain almost
        anything: freeform text, <choice> tags, <xref> tags, even other
        <p> tags.  If a "class='sentence'" attribute is found, a flag
        is set and the next word will be capitalized.  If a "chance='X'"
        attribute is found, there is an X% chance that the tag will be
        evaluated (and therefore a (100-X)% chance that it will be
        completely ignored)
        """
        keys = node.attributes.keys()
        if "class" in keys:
            if node.attributes["class"].value == "sentence":
                self.capitalizeNextWord = 1
        if "chance" in keys:
            chance = int(node.attributes["chance"].value)
            doit = (chance > random.randrange(100))
        else:
            doit = 1
        if doit:
            map(self.parse, node.childNodes)

    def do_choice(self, node):
        """handle <choice> tag
        
        A <choice> tag contains one or more <p> tags.  One <p> tag
        is chosen at random and evaluated; the rest are ignored.
        """
        self.parse(self.randomChildElement(node))

def usage():
    print __doc__

def main(argv):
    grammar = None
    source = None
    wrap = None
    try:
        opts, args = getopt.getopt(argv, "hg:s:w:d", ["help", "grammar=","source=","wrap="])
    except getopt.GetoptError:
        usage()
        sys.exit(2)
    for opt, arg in opts:
        if opt in ("-h", "--help"):
            usage()
            sys.exit()
        elif opt == '-d':
            global _debug
            _debug = 1
        elif opt in ("-g", "--grammar"):
            grammar = arg
        elif opt in ("-s", "--source"):
            source = arg
        elif opt in ("-w", "--wrap"):
            try:
                wrap = int(arg)
            except ValueError:
                print "Warning: ignoring invalid wrap option: %s" % arg
    
    if not grammar:
        grammar = "kant.xml"
    
    if not source:
        if args:
            source = "".join(["<xref id='%s'/>" % arg for arg in args])

    k = KantGenerator(grammar, source)
    if wrap:
        print toolbox.hardwrap(k.output(), wrap)
    else:
        print k.output()

if __name__ == "__main__":
    main(sys.argv[1:])

Example 5.2. toolbox.py

"""Miscellaneous utility functions"""

def hardwrap(s, maxcol=72):
    """hard wrap string to maxcol columns

    Example:
    >>> print hardwrap("This is a test of the emergency broadcasting system", 25)
    This is a test of the
    emergency broadcasting
    system.
    """
    import re
    pattern = re.compile(r'.*\s')
    def wrapline(s, pattern=pattern, maxcol=maxcol):
        lines = []
        start = 0
        while 1:
            if len(s) - start <= maxcol: break
            m = pattern.match(s[start:start + maxcol])
            if not m: break
            newline = m.group()
            lines.append(newline)
            start += len(newline)
        lines.append(s[start:])
        return "\n".join([s.rstrip() for s in lines)])
    return "\n".join(map(wrapline, s.split("\n")))

def openAnything(source):
    """URI, filename, or string --> stream

    This function lets you define parsers that take any input source
    (URL, pathname to local or network file, or actual data as a string)
    and deal with it in a uniform manner.  Returned object is guaranteed
    to have all the basic stdio read methods (read, readline, readlines).
    Just .close() the object when you're done with it.
    
    Examples:
    >>> from xml.dom import minidom
    >>> sock = openAnything("http://localhost/kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    """
    # try to open with urllib (if source is http, ftp, or file URL)
    import urllib
    try:
        return urllib.urlopen(source)
    except IOError:
        pass
    
    # try to open with native open function (if source is pathname)
    try:
        return open(source)
    except IOError:
        pass
    
    # assume source is string, create stream
    return StringIO.StringIO(source)

Run the program kgp.py by itself, and it will print several paragraphs worth of philosophy in the style of Immanuel Kant.

Example 5.3. Sample output of kgp.py


     As is shown in the writings of Hume, our a priori concepts, in
reference to ends, abstract from all content of knowledge; in the study
of space, the discipline of human reason, in accordance with the
principles of philosophy, is the clue to the discovery of the
Transcendental Deduction.  The transcendental aesthetic, in all
theoretical sciences, occupies part of the sphere of human reason
concerning the existence of our ideas in general; still, the
never-ending regress in the series of empirical conditions constitutes
the whole content for the transcendental unity of apperception.  What
we have alone been able to show is that, even as this relates to the
architectonic of human reason, the Ideal may not contradict itself, but
it is still possible that it may be in contradictions with the
employment of the pure employment of our hypothetical judgements, but
natural causes (and I assert that this is the case) prove the validity
of the discipline of pure reason.  As we have already seen, time (and
it is obvious that this is true) proves the validity of time, and the
architectonic of human reason, in the full sense of these terms,
abstracts from all content of knowledge.  I assert, in the case of the
discipline of practical reason, that the Antinomies are just as
necessary as natural causes, since knowledge of the phenomena is a
posteriori.
    The discipline of human reason, as I have elsewhere shown, is by
its very nature contradictory, but our ideas exclude the possibility of
the Antinomies.  We can deduce that, on the contrary, the pure
employment of philosophy, on the contrary, is by its very nature
contradictory, but our sense perceptions are a representation of, in
the case of space, metaphysics.  The thing in itself is a
representation of philosophy.  Applied logic is the clue to the
discovery of natural causes.  However, what we have alone been able to
show is that our ideas, in other words, should only be used as a canon
for the Ideal, because of our necessary ignorance of the conditions.
    Certainly, the thing in itself, indeed, abstracts from all content
of a priori knowledge, by virtue of practical reason.  What we have
alone been able to show is that our ideas have lying before them the
paralogisms of human reason; consequently, our sense perceptions, on
the contrary, can be treated like time.  The Antinomies have lying
before them our synthetic judgements, but necessity, for these reasons,
is the mere result of the power of time, a blind but indispensable
function of the soul.  Because of the relation between the Ideal and
natural causes, we can deduce that the manifold stands in need of our
ideas; for these reasons, the Categories abstract from all content of
knowledge.  (The Antinomies, as I have elsewhere shown, prove the
validity of our ideas.)  The architectonic of human reason can be
treated like the Antinomies.  There can be no doubt that metaphysics,
in particular, would be falsified.
    Our problematic judgements occupy part of the sphere of the
transcendental aesthetic concerning the existence of our faculties in
general; consequently, our a priori judgements would thereby be made to
contradict our ideas.  We can deduce that, so far as I know, the
paralogisms can not take account of the transcendental unity of
apperception.  We can deduce that philosophy excludes the possibility
of, in particular, our synthetic judgements.  However, the things in
themselves are just as necessary as the things in themselves.  As is
evident upon close examination, the phenomena, certainly, exclude the
possibility of our ideas, but the Categories (and it is obvious that
this is the case) can not take account of the Antinomies.  As is shown
in the writings of Hume, the Antinomies, on the other hand, are the
mere results of the power of our experience, a blind but indispensable
function of the soul.  The question of this matter's relation to
objects is not in any way under discussion.
    The thing in itself would thereby be made to contradict applied
logic.  Because of our necessary ignorance of the conditions, there can
be no doubt that, on the contrary, our ideas are just as necessary as,
indeed, the manifold, but our a priori concepts have lying before them
space.  The thing in itself excludes the possibility of, consequently,
the noumena.  There can be no doubt that our experience can thereby
determine in its totality, in the full sense of these terms, space.  By
means of general logic, is it the case that time would thereby be made
to contradict the Ideal of human reason, or is the real question
whether the phenomena abstract from all content of a priori knowledge?
In the study of time, to avoid all misapprehension, it is necessary to
explain that human reason exists in our faculties.  It must not be
supposed that the noumena (and Galileo tells us that this is the case)
are what first give rise to the transcendental aesthetic.
    The never-ending regress in the series of empirical conditions can
never furnish a true and demonstrated science, because, like the
Transcendental Deduction, it teaches us nothing whatsoever regarding
the content of ampliative principles.  The Antinomies, so far as
regards general logic, abstract from all content of knowledge, because
of our necessary ignorance of the conditions.  To avoid all
misapprehension, it is necessary to explain that, insomuch as
metaphysics relies on the phenomena, the paralogisms prove the validity
of the things in themselves.  It remains a mystery why the things in
themselves, consequently, can never, as a whole, furnish a true and
demonstrated science, because, like the transcendental unity of
apperception, they are what first give rise to analytic principles, as
we have already seen.  But can I entertain the Ideal of practical
reason in thought, or does it present itself to me?  Since knowledge of
our a posteriori concepts is a posteriori, what we have alone been able
to show is that the architectonic of natural reason constitutes the
whole content for the things in themselves; for these reasons, our a
priori concepts, for these reasons, prove the validity of natural
causes.
    As will easily be shown in the next section, the employment of the
Categories, in the full sense of these terms, has nothing to do with
the Antinomies.  Time (and what we have alone been able to show is that
this is true) can thereby determine in its totality necessity.  Because
of the relation between the thing in itself and our judgements, let us
suppose that time is just as necessary as, however, the discipline of
practical reason.  In the study of our experience, it remains a mystery
why the manifold (and the reader should be careful to observe that this
is true) stands in need of the transcendental aesthetic, as is evident
upon close examination.  (As is evident upon close examination, we can
deduce that, in so far as this expounds the contradictory rules of
necessity, the Ideal exists in the Antinomies.)  To avoid all
misapprehension, it is necessary to explain that our ideas constitute
the whole content of, in respect of the intelligible character, the
paralogisms; consequently, our ideas, therefore, would be falsified.
By means of analytic unity, the phenomena are by their very nature
contradictory.
    Transcendental logic is a body of demonstrated science, and all of
it must be known a priori.  As we have already seen, the Ideal can
never furnish a true and demonstrated science, because, like the
manifold, it is a representation of a priori principles.  We can deduce
that, in accordance with the principles of necessity, the practical
employment of our concepts, in particular, is just as necessary as the
things in themselves.  It is obvious that, in other words, philosophy
(and it must not be supposed that this is true) excludes the
possibility of the never-ending regress in the series of empirical
conditions, but our analytic judgements (and the reader should be
careful to observe that this is the case) are what first give rise to
the objects in space and time.  To avoid all misapprehension, it is
necessary to explain that practical reason, on the contrary, has
nothing to do with our experience.  The objects in space and time can
never, as a whole, furnish a true and demonstrated science, because,
like the Transcendental Deduction, they stand in need to a priori
principles.  Applied logic, with the sole exception of necessity, can
never furnish a true and demonstrated science, because, like the thing
in itself, it is a representation of inductive principles, but the
architectonic of pure reason depends on, then, the discipline of pure
reason.
    Consequently, Aristotle tells us that philosophy, therefore, is a
body of demonstrated science, and all of it must be known a posteriori.
 However, there can be no doubt that space is the key to understanding
the Categories, as will easily be shown in the next section.  It is
obvious that the employment of the things in themselves, in respect of
the intelligible character, is what first gives rise to the things in
themselves, as will easily be shown in the next section.  Let us
suppose that, even as this relates to the never-ending regress in the
series of empirical conditions, our experience, in accordance with the
principles of the Transcendental Deduction, has nothing to do with
applied logic, yet our sense perceptions constitute the whole content
of the transcendental aesthetic.  And can I entertain metaphysics in
thought, or does it present itself to me?  As is shown in the writings
of Aristotle, the empirical objects in space and time are a
representation of the architectonic of human reason, yet the Ideal is
the clue to the discovery of the objects in space and time.  We can
deduce that necessity is the clue to the discovery of our sense
perceptions; consequently, our sense perceptions, on the other hand,
should only be used as a canon for our sense perceptions.

This is, of course, complete gibberish. Well, not complete gibberish. It is syntactically and grammatically correct (although very verbose -- Kant wasn't what you would call a get-to-the-point kind of guy). Some of it may actually be true (or at least the sort of thing that Kant would have agreed with), some of it is blatantly false, and most of it is simply incoherent. But all of it is in the style of Immanuel Kant.

Let me repeat that this is much, much funnier if you are now or have ever been a philosophy major.

The interesting thing about this program is that there is nothing Kant-specific about it. All the content is derived from the context-free grammar file, kant.xml (which we will look at in more detail in the next section). All the kgp.py program does is read through the grammar and make random decisions about which words to plug in where, like a Mad-Libs™ on autopilot. This opens up the door for all kinds of randomly generated output; we'll look at examples of different grammars this later in the chapter.