The Medical University of South Carolina
Discovering...Understanding...Healing...

User

Creative Commons License

This is an extremely simple introduction to XML parsing.

XML

First, an example XML document:

<?xml version="1.0" encoding="utf-8"?>
<list xmlns="http://example.com/schema/list">
 <one foo="bar" />
 <two foo="baz">
   This is some text
 </two>
 <!-- This is a comment -->
 <three>
   There can be some text
   <four>
     followed by a child element
   </four>
   followed by more text
 </three>
</list>

Note that there are differences between well-formed and valid XML. See http://en.wikipedia.org/wiki/XML and the W3C spec. Also learn about XML schema and XML namespaces.

DOM

Document Object Model: convert a document to a document object. There is a standard API for document objects.

    dom = parse("example.xml")
    root = dom.documentElement
    for child in root.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            print "\"%s\" is a child element" % child.nodeName
    twos = root.getElementsByTagName("two")
    if len(twos) > 0:
        two = twos[0]
        print "two's foo is %s" % two.getAttribute("foo")

SAX

Simple API for XML: uses a callback model for parsing XML. This is used for streaming XML and large documents.

class ListSAXHandler(ContentHandler):
    def __init__(self, attrname):
        self.attrname = attrname


    def startElement(self, name, attrs):
        print "Just got an element named %s" % name
        for key, value in attrs.items():
            if key == self.attrname:
                print "Found attr name of %s and its value is %s" % (self.attrname, value)


    def endElement(self, name):
        print "Just finished an element named %s" % name



parser = make_parser()
handler = ListSAXHandler('foo')
parser.setContentHandler(handler)
parser.parse("example.xml")

DOMLight

DOMLight: an example of a simple DOM access class in python. See DOMLight.

    f = open('example.xml', 'r')
    xml = f.read()
    f.close()

    root = DOMLight.createModel(xml)
    print "namespace is %s" % root['xmlns']
    # get root's first child named 'one'
    one = root.one[0]
    # access an attribute
    print one['foo']
    # get root's first child named 'two' and print the first child that is a text node
    print "two's text: %s" % root.two[0].text[0]

Files

CarcWiki: Documentation/XMLParsing (last edited 2007-03-27 07:48:56 by BrianMuller)

171 Ashley Avenue · Charleston SC 29425 · (843) 792-2300