Dies ist die Support Website des Buches:

Das Python Praxisbuch
Der große Profi-Leitfaden für Programmierer
Farid Hajji
Addison Wesley / Pearson Education
ISBN 978-3-8273-2543-3 (Sep 2008), 1298 Seiten.

12. XML und XSLT

Eine XML-Datei

<?xml version="1.0" encoding="utf-8"?>
<languages>

  The interpreted languages are Python, Ruby, Perl and PHP.

  <language name="python">
    <!-- Our favorite language -->
    <name>Python</name>
    <inventor>Guido van Rossum</inventor>
    <url>www.python.org</url>
  </language>
  <language name="ruby">
    <name>Ruby</name>
    <inventor>Yukihiro Matsumoto</inventor>
    <url>www.ruby-lang.org</url>
  </language>
  <language name="perl">
    <name>Perl</name>
    <inventor>Larry Wall</inventor>
    <url>www.perl.org</url>
  </language>
  <language name="php">
    <name>PHP</name>
    <inventor>Rasmus Lerdorf</inventor>
    <url>www.php.net</url>
  </language>

  Compiled languages are C and C++

  <language name="c">
    <name>C</name>
    <inventor>Dennis Ritchie</inventor>
    <inventor>Brian Kernighan</inventor>
  </language>
  <language name="c++">
    <name>C++</name>
    <inventor>Bjarne Stroustrup</inventor>
  </language>

  Lisp is normally interpreted, but can be compiled as well

  <language name="lisp">
    <name>Lisp</name>
    <inventor>John McCarthy</inventor>
  </language>
</languages>

languages.xml

Screenshots:

xml.etree.ElementTree

4Suite-XML

Screenshots:

4Suite-XML installieren

URLs:

Screenshots:

Die 4Suite-XML-Skripte

<a><b></b><b></b></a>

test.xml

<a><b></b><b></a></a>

test2.xml

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b/></a>

test3.xml

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b/><b/></a>

test4.xml

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b>Non empty b</b></a>

test5.xml

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:output method="xml" indent="yes"
          omit-xml-declaration="no"
          doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
          doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>

  <xsl:template match="/">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
      <head><title>Programming Languages</title></head>
      <body>
        <h1>Programming Languages</h1>
        <ul>
          <xsl:apply-templates select="/languages/language"/>
	</ul>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="language">
    <li xmlns="http://www.w3.org/1999/xhtml">
      <b><xsl:copy-of select="name/text()"/></b>
      <xsl:text>: </xsl:text>
      <xsl:apply-templates select="inventor"/>
    </li>
  </xsl:template>

  <xsl:template match="inventor">
    <xsl:copy-of select="text()"/>
    <xsl:text> </xsl:text>
  </xsl:template>

</xsl:stylesheet>

languages.xsl

Screenshots:

Nach 4xslt languages.xml languages.xsl > languages.html sieht languages.html wie folgt aus:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <title>Programming Languages</title>
  </head>
  <body>
    <h1>Programming Languages</h1>
    <ul>
      <li>
        <b>Python</b>: Guido van Rossum </li>
      <li>
        <b>Ruby</b>: Yukihiro Matsumoto </li>
      <li>
        <b>Perl</b>: Larry Wall </li>
      <li>
        <b>PHP</b>: Rasmus Lerdorf </li>
      <li>
        <b>C</b>: Dennis Ritchie Brian Kernighan </li>
      <li>
        <b>C++</b>: Bjarne Stroustrup </li>
      <li>
        <b>Lisp</b>: John McCarthy </li>
    </ul>
  </body>
</html>

Screenshots:

Ft.Xml.InputSource-Eingabequellen

Wie man Ft.Xml.InputSource benutzt:

from Ft.Xml import InputSource

factory = InputSource.DefaultFactory

isrc1 = factory.fromString("<a><b/><b/></a>",
                           "https://pythonbook.hajji.org/examples/xml")

isrc2 = factory.fromStream(open("/var/tmp/languages.xml", "rb"),
                           "https://pythonbook.hajji.org/examples/xml")

isrc3 = factory.fromUri(
             "https://pythonbook.hajji.org/examples/xml/languages.xml")

DOM

isrc4 sieht so aus:

>>> isrc4 = factory.fromString('''<?xml version="1.0" encoding="utf-8"?>
... <!DOCTYPE a [
...   <!ELEMENT a (b, b)>
...   <!ELEMENT b EMPTY>
... ]>
... <a><b/><b/></a>''', "https://pythonbook.hajji.org/examples/xml")

Und isrc5 so:

from Ft.Xml import ReaderException

isrc5 = factory.fromString('''<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b>Not empty</b></a>''',
"https://pythonbook.hajji.org/examples/xml")

try:
    doc5 = vreader.parse(isrc5)
    print "doc5 successfully parsed"
except ReaderException, e:
    print e

DOM verstehen

Um die Ausgabe als (Byte-)String zu erhalten, kann man StringIO benutzen:

from cStringIO import StringIO

sio = StringIO()
PrettyPrint(doc1, stream=sio, encoding="utf-8")
buf = sio.getvalue()
sio.close()

URLs:

Elemente mit XPath extrahieren

Man kann Elemente aus einer XML-Datei direkt extrahieren, aber es ist subobtimal:

from Ft.Xml import InputSource
from Ft.Xml.Domlette import NonvalidatingReaderBase

factory = InputSource.DefaultFactory
reader = NonvalidatingReaderBase()

isrc2 = factory.fromStream(open("/var/tmp/languages.xml", "rb"),
                           "https://pythonbook.hajji.org/examples/xml")

doc2 = reader.parse(isrc2)
root = doc2.documentElement

python = root.childNodes[1]

Einfacher geht es mit XPath:

>>> root.xpath(u'//language[@name="python"]')
[<Element at 0x286a18ac: name u'language', 1 attributes, 9 children>]

>>> python = root.xpath(u'//language[@name="python"]')[0]

>>> python
<Element at 0x286a18ac: name u'language', 1 attributes, 9 children>

URLs:

SAX

Die Eingabequelle:

from Ft.Xml import InputSource

factory = InputSource.DefaultFactory
isrc = factory.fromUri("file:///var/tmp/languages.xml")

Der SAX-Parser (Saxlette):

from Ft.Xml import Sax

parser = Sax.CreateParser()

Der Content-Handler:

class TagCounter(object):
    def startDocument(self):
        self.tagCount = {}
    
    def startElementNS(self, name, qname, attribs):
        if name in self.tagCount:
            self.tagCount[name] += 1
        else:
            self.tagCount[name] = 1

tagcounter.py

Und nun parsen wir:

from tagcounter import TagCounter

handler = TagCounter()
parser.setContentHandler(handler)
parser.parse(isrc)

URLs:

Transformationen mit XSLT

Um languages.xml mit languages.xsl zu transformieren:

from Ft.Xml import InputSource
factory = InputSource.DefaultFactory

ixml = factory.fromUri('file:///var/tmp/languages.xml')
ixsl = factory.fromUri('file:///var/tmp/languages.xsl')

from Ft.Xml.Xslt import Processor
processor = Processor.Processor()

processor.appendStylesheet(ixsl)
result = processor.run(ixml)

result enthält:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <title>Programming Languages</title>
  </head>
  <body>
    <h1>Programming Languages</h1>
    <ul>
      <li>
        <b>Python</b>: Guido van Rossum </li>
      <li>
        <b>Ruby</b>: Yukihiro Matsumoto </li>
      <li>
        <b>Perl</b>: Larry Wall </li>
      <li>
        <b>PHP</b>: Rasmus Lerdorf </li>
      <li>
        <b>C</b>: Dennis Ritchie Brian Kernighan </li>
      <li>
        <b>C++</b>: Bjarne Stroustrup </li>
      <li>
        <b>Lisp</b>: John McCarthy </li>
    </ul>
  </body>
</html>

Zusammenfassung