This is an extremely simple introduction to XML parsing.
XML
First, an example XML document:
<?xml version="1.0" encoding="utf-8"?>
<list xmlns="http://example.com/schema/list">
<one foo="bar" />
<two foo="baz">
This is some text
</two>
<!-- This is a comment -->
<three>
There can be some text
<four>
followed by a child element
</four>
followed by more text
</three>
</list>
Note that there are differences between well-formed and valid XML. See http://en.wikipedia.org/wiki/XML and the W3C spec. Also learn about XML schema and XML namespaces.
DOM
Document Object Model: convert a document to a document object. There is a standard API for document objects.
dom = parse("example.xml")
root = dom.documentElement
for child in root.childNodes:
if child.nodeType == child.ELEMENT_NODE:
print "\"%s\" is a child element" % child.nodeName
twos = root.getElementsByTagName("two")
if len(twos) > 0:
two = twos[0]
print "two's foo is %s" % two.getAttribute("foo")
SAX
Simple API for XML: uses a callback model for parsing XML. This is used for streaming XML and large documents.
class ListSAXHandler(ContentHandler):
def __init__(self, attrname):
self.attrname = attrname
def startElement(self, name, attrs):
print "Just got an element named %s" % name
for key, value in attrs.items():
if key == self.attrname:
print "Found attr name of %s and its value is %s" % (self.attrname, value)
def endElement(self, name):
print "Just finished an element named %s" % name
parser = make_parser()
handler = ListSAXHandler('foo')
parser.setContentHandler(handler)
parser.parse("example.xml")
DOMLight
DOMLight: an example of a simple DOM access class in python. See DOMLight.
f = open('example.xml', 'r')
xml = f.read()
f.close()
root = DOMLight.createModel(xml)
print "namespace is %s" % root['xmlns']
# get root's first child named 'one'
one = root.one[0]
# access an attribute
print one['foo']
# get root's first child named 'two' and print the first child that is a text node
print "two's text: %s" % root.two[0].text[0]


