Open Source XML Applications
21 September 2009
XML Tools and Applications with FLOSS
Abstract
From the very beginnings of XML, open source parsers and processors
have been freely available to the XML development community. Over the
last ten years, building on these basic foundations, a huge library of
free/libre open source software (FLOSS) has grown up to support and
capitalise on XML open standards. This class explores some of those
XML tools and applications — covering both the essential and
esoteric — and shows how it can make a real impact for you and
your organisation.
Overview
- Introduction
- Input
- Processing
- Storage
- Output
Organization of the presentation
Free/libre open-source software covers almost all of IT.
So does XML.
This is thus a 90-minute course in what kinds of software exist.
Needless to say, it is not complete.
From time to time, we will pause for a demo.
Who's here? (1)
So it's clearer where to linger, who ...
- writes programs for a living?
- writes programs from time to time?
- has learned a programming language?
- has other people to deal with that kind of thing?
Who's here? (2)
How many of you work / have an interest in ...
- SQL databases and their ilk?
- Word? other office-automation software?
- publishing? back-of-the-book indexing?
- Web site design?
Road map: processes
Things we do:
- input: create XML
- process: update, transform, change, mangle, munge, mince,
macerate, puree, aggregate, mix, combine, transmit,
interchange, ... our XML
- store: save, query, extract, retrieve, manage XML
- output: display, format, print, deliver XML
Some software specializes, some generalizes.
Road map: agents
Who does it / whom is it for?
This sometimes matters a lot, sometimes very little.
Road map: orientation
How general is it?
- vertical: specific industries or application areas
(e.g. ‘pig-farming markup language’)
- horizontal: no specific area, universal importance
(typically infrastructure)
- diagonal: not exactly horizontal, not exactly vertical. E.g.
- office documents
- slide shows
- images, graphics
- Web application tools
- software development tools
What counts as open source?
You tell me.
- included if clearly open source
- excluded if clearly not
- In the large gray area I have
chosen to be arbitrary,
inconsistent, and capricious. When the details
matter to you, check the license.
Input
Document creation and validation.
- editors
- parsers, validators
- conversion tools
- XForms
- office documents
Editors 0: overview
- programmers' editors extended for XML
- HTML editors extended for XML
- GUI XML editors
- IDEs
What most people choose ...
Editors 1: text-based XML editors
Mostly these are programmers' editors extended to handle XML.
- jEdit,
mature programmers' text editor (popular as Eclipse plugin), in Java
- nxml.el
(James Clark) a major mode for Emacs (GNU Emacs only)
- psgml.el
(Lennart Stafflin et al.) an Emacs major mode for SGML and XML
- Rinzo XML editor
Eclipse plug-in, close integration with Java
- vim does XML syntax coloring automatically nowadays (scripts on
www.vim.org can help, too)
- XED
(Henry S. Thompson) XML instance editor; works very hard to make
it impossible to make ill-formed documents
Editors 2: HTML/XML editors
Mostly these are HTML editors extended to handle XML.
- Amaya (INRIA, W3C) —
predominantly a “Web editor”, but extended to support XML. Generic
XML editing is “still experimental” (29 Feb. 2008)
- Quanta Plus
- Screem (web development environment with
XML-capable editor)
Editors 3: GUI XML editors
N.B. “GUI” ≠ “WYSIWYG”!
- BXE
(Liip AG):
browser-based WYSIWYG XML editor (currently Mozilla only)
- Jaxe,
configurable WYSIWYM editor in Java; also available in applet form as
WebJaxe
- MIView (Gnome)
source and tree views, validation, ...
- Pollo
XML editor in Java; heavy emphasis on tree structure (best for
data-oriented XML?), proud of its tree widget
- Serna Free XML Editor
(Syntext) open version of commercial product
- Vex (John Krasnay et al.)
visual editor for XML, in Java
- Xerlin (Exari?) Opensource Extensible
XML Modeling Application (based on Merlot)
- XML Copy Editor
(G. N. Schmidt), syntax coloring etc., but mostly text-oriented
N.B. boundary to following list is hazy.
Editors 4: XML IDEs
Interactive development environments, mostly aimed at
programmers. Boundary to preceding list hazy.
-
Butterfly XML
error highlighting, incremental parsing, auto-completion, XSLT pipelines, etc.
- Cooktop (Victor Pavlov)
editor and development environment; Windows only. Freeware (as in beer).
- OrangevoltXSLT
(XSLT development environment, Eclipse plug-in)
- XCarecrows 4 XML (Cogenit)
Eclipse plug-in (XML, XSD, XSLT editor, graphic tree comparator,
schema validator, XSLT transformation tool kit)
- XPontus
(Yves Zoundi), text-oriented, with validation, transforms, DTD generation, etc.
Demo: jEdit
Pause here for brief demo of jEdit.
Note:
- syntax coloring
- tree view
- context-sensitive element insertion
- limits of syntax awareness
- treatment of entities
Demo: Serna Free XML editor
Pause here for brief demo of Serna.
Note:
- WYSIWYG display (XSL FO)
- consequent setup overhead
- styling
- tree view
- context-sensitive element insertion
- treatment of entities
Parsers
For software, essentially several classes of parsers / several
interfaces:
- SAX Simple API for XML
- StAX Streaming API for XML
- DOM Document Object Model
- JAXP (Java API for XML Processing) provides both SAX and DOM
- parsers with other interfaces
Many parsers supplied in program libraries,
essentially invisible.
For humans, not much difference.
Major current parsers
- DOM4J JAXP (thus SAX and DOM) compliant
parser in Java
- expat
(James Clark); in C; SAX and DOM interfaces often built on top; non-validating
- libxml2 (Gnome / Daniel Veillard)
SAX* + DOM (gdome); in C; validating
- rxp (Richard
Tobin); in C
- Woodstox
(Codehaus) STaX-compliant validating parser in Java
- Xerces C++ (Apache)
validating SAX, SAX2, DOM parser
- Xerces J (Apache)
validating SAX2, DOM parser
- XML::LibXML
(Petr Pajas) Perl wrapper for Llibxml
- XML::SAX::Expat
(Björn Höhrmann) Perl wrapper for expat
- XP (James Clark)
non-validating; in Java; intended for delivery not authoring,
so “error handling is brutal”
Other parsers
Some of these are listed mostly for historical interest.
- Aelfred
(Saxon version)
- Aelfred 2 (GNU XML package)
- Crimson (Sun)
XML parser in JDK up to 1.4; SAX, DOM
- GNU JAXP
- Lark (Tim Bray)
one of the first XML parsers; Larval (validating)
built on same code base
- VTD-XML
- XML::Sax::PurePerl
fallback when others unavailable
- XML::Twig to process
large documents in limited space
- XML::Xerces (Perl
interface to Xerces C++)
- XParse-J
“aspires to be the smallest Java XML parser on the planet.”
- XOM (Elliotte Rusty Harold)
Parsing infrastructure
In particular, XML catalogs and URI resolvers.
Both SAX and JAXP provide hooks for user-specified
entity resolvers.
Some but not all parsers support catalogs. It matters.
Validation
- DTDs
- XSD
- Relax NG
- Schematron
- other validation technologies
DTD validators
All parsers described as “validating”
are DTD validators.
XSD validators
- libxml2 / xmllint (Gnome)
partial implementation
- MSV Multi-schema validator (Sun)
- Xerces-C (Apache)
- Xerces-J (Apache)
Apache version, JDK version*
- XSV
(Univ. Edinburgh and W3C)
Many data binding tools also produce validating code. See also
Relax NG validators
- Jing (James
Clark) also validates XSD,
Schematron 1.5, NVDL
- libxml2 / xmllint (Gnome)
partial implementation
- MSV Multi-schema validator (Sun)
- RNV (David Tolpin)
implementation of Relax NG compact syntax
Schematron implementations
XForms 1: in the browser
- Client-side plugins:
- Mozilla XForms Project
plug-in for Mozilla, Firefox, SeaMonkey, etc.
- MozzIE
plugin for MS Internet Explorer, displays XForms using Gecko (sic)
- Client-side implementations:
XForms 2: server-side
The server does most or all of the work:
XForms 3: stand-alone
Stand-alone and embedded:
- OpenOffice
- X-Smiles
(Helsinki Univ. of Technology)
XML browser in Java, runs both stand-alone and embedded;
can also run in browser as applet
Processing (1)
- XML and programming languages (‘data binding’)
- Web services
- XML messages / PKI
- programmer tools
- user-interface specification
- processing tools specifically for XML
Data binding tools
See also object-relational mapping.
XML messages
Transmitting XML messages may involve:
- encryption
- digital signatures
- canonicalization
- compression
- character recoding issues
Public-key infrastructure
Most public-key infrastructure specs
now implemented in standard libraries. But see also:
Other programmer tools: toolkits
- 4Suite
open-source platform for XML and RDF processing
(→ Amara2)
- lxml
Python access to libxml
- LT-XML
(Language Technology Group, Univ. Edinburgh) a set of tools in C,
with Python interfaces; includes sggrep
Other programmer tools: diff
File comparison (diff) tools:
- XmlDiff (part of VM Tools;
aimed at data-oriented XML only, not human-readable documents)
- 3DM XML
3-way merging and differencing tool (Tancred Lindholm)
- diffxml
(Adrian Monat)
- DiffMk (Norm Walsh)
- XMLunit
(diff class, part of larger complex)
- JXyDiff
- diffx (Topologi)
- xmlpatch
(includes a simple xml-diff utility)
- xmldiff (by
CoreFiling; uses xmlpp pretty-printer, then system diff)
User-interface description languages / tools
See also XForms.
Processing (2): native-XML tools
- XSLT
- XQuery
- XPath
- XProc
- Other
XSLT
- Amara2 XSLT 1.0 + EXSLT
- Gestalt (Gnu / Colin Adams)
basic-level XSLT 2.0 processor, in Eiffel
- libxslt (Gnome / Daniel Veillard)
XSLT 1.0, in C
- Saxon (Saxonica / Michael Kay)
1.0 and 2.0
- Xalan-C++ (Apache)
XSLT 1.0
- Xalan-Java (Apache)
XSLT 1.0
And do not overlook
- FXSL (Dimitre Novatchev)
library that makes XSLT 1.0 and 2.0 into fully functional languages*
XQuery (1)
Static / on-disk / indexed XQuery:
- BaseX (Univ. Konstanz)
native XML db, Java, with visual front end
- eXist (Wolfgang Meier)
very popular XML database
- MonetDB/XQuery (CWI, Amsterdam)
XQuery front end to MonetDB SQL database
- Oracle
Berkeley DB XML (sic)
- OrientX (Renmin University
of China) native XML dbms
- Rainbow
(Worcester Polytechnic) XQuery processing system using relational technology
- Sedna
(Institute for System Programming of the Russian Academy of Sciences)
native XML database in C/C++
- XQEngine
(Fatdog Software Inc.), oriented toward collections and full-text search
XQuery (2)
Dynamic / in-memory XQuery implementations:
- GCX
(DBIS, Univ. Freiburg) Garbage-collected XQuery, in-memory streaming XQuery processor
- Saxon (Michael Kay / Saxonica)
- Qexo (GNU)
partial implementation based on Gnu Kawa,
compiles queries to Java byte code (or to native code)
- qizx/open
- QueryMachine.XQuery
(Semyon A. Chertkov) standalone XQuery implementation in .NET
- xbird
“light-weight” processor in Java
- XQiB (FLWOR Foundation)
XQuery in the browser “the same as JavaScript, just with
less code” (experimental)
- XQilla
- XQP (Univ. Texas at Arlington)
XQuery processing on a P2P system
XQuery (3)
Other XQuery implementations:
See also
- nux (Lawrence Berkeley Lab)
Java toolkit for XML processing (XQuery, update, full-text search ...)
- XQDT (FLWOR Foundation)
XQuery Development Tools
- XQ2XML (David Carlisle)
translations from XQuery into XML, XQueryX, XSLT, and ... XQuery
XPath
- AquaPath (Todd
Ditchendorf) XPath 2.0 evaluator (Mac OS X only)
- PsychoPath
(Eclipse) schema-aware XPath 2.0 processor (library; API, no UI)
- XpathWorkbook
Eclipse plug-in for testing XPath expressions
XProc processors
- Calabash
(Norman Walsh)
- Cocoatron (Todd Ditchendorf)
Mac OS X only
- Half-Pipe
(Philip Fennell) partial implementation in XSLT, atop Saxon 9
- Tubular (Herve
Quiroz)
- xmlsh (David Lee)
command line shell for XML; now includes XProc module
- xprocxq
(James Fuller) XProc in XQuery (to be integrated with eXist)
- yax (Jörg
Möbius)
XML scripting
- xmlsh (David Lee)
command line shell for XML; now includes XProc module
- xsh
and XML::XSH
(Petr Pajas) XML editing shell (and Perl access)
- virgule
(HTML/XML-based scripting language “conceptually similar to Lisp”)
Storage and retrieval
- XML and databases
- object/relational mapping
- native XML databases
- XML indexing and search
- intelligent search
XML and (SQL) databases
SQL databases with some form of XML support (see also
XQuery)
- MonetDB (CWI, Amsterdam)
SQL database with XML support (also XQuery front end)
- mySQL 5.1 and 6.0 have
(limited) support for XML*
- mySQLdump
can dump database contents in
a simple (flat) XML form.
- PostgreSQL
(limited XML support)
Object/relational mapping
Mapping back and forth among objects, XML, and relational DBMS.
- Cayenne (Apache
project for object relational mapping, persistence and caching for Java;
can serialize to XML)
-
Castor:
Java objects → XML → Java objects
- dbsql2xml
maps relational data into trees; can de-normalize for nicer trees
- DBIx::XML::DataLoader,
Perl module to
transfer data from an XML document into a SQL database
- XML::Generator::DBI,
Perl module for creating XML from existing DBI datasources
- Hibernate,
object/relational persistence and query service; maps between relations and objects and/or XML (in DOM4J form)
- Hyperjaxb2
combines JAXB and Hibernate, generates Hibernate mappings from XSD schemas
- XML-DBMS (Ron Bourret)
See also data binding.
Smart applications
- indexing and search
- topic maps
- ontology managers
Indexing and search
See also XML and databases, XQuery.
Topic maps
- Isidorus
(Marc Wilhelm Küster)
TM engine in Common Lisp
- LTM processor
(Ontopia / Bouvet) reads linear topic-map format, builds object structures,
exports XML
- mappa (Lars Heuer)
Python TM engine
- Onotoa (Hannes Niederhausen)
Eclipse-based ontology editor for TM
- Ontopia (Bouvet)
topic-map engine, now open-source.
- QuaaxTM (Johannes Schmidt)
TM engine with PHP interfaces
- RTM Ruby Topic Maps
(Benjamin Bock) TM engine for Ruby
- SharpTM
(Marcel Hoyer)
small TM engine for .NET
- Tiny TiM
Java TM engine with “small overhead and minimal runtime dependencies”
- TM (Robert Barta)
Perl module for topic maps reference model
- TM++
(Inge Henriksen) embedded persistent TM engine
- TM4J (Kal Ahmed?) umbrella for several
Java-based TM packages (TM engine, desktop TM navigator, graph
creation tools, integration with Web application frameworks)
- TM4Jscript
(Alexander Johannesen, Thomas Passin)
TM engine in Javascript
- TMAPIX
(Lars Heuer) Java library for TMAPI-compliant engines,
provides XPath-like queries
- Versavant (Steve Newcomb)
“Topic Map Application (TMA) Bus / Subject Addressing Engine”
following TM reference model (rather than TM data model)
- Wandora
“desktop application to build and manage topic maps” with
GUI interface
- WP2TM
(TopicObserver.com) WordPress plug-in to
turn RSS feed into XTM feed
- XTM4XMLDB
(Stefan Lischke)
Java TM engine, supports TMAPI atop any XMLDB database
- ZTM Zope Topic Maps
(Bouvet) Python-based tools for building TM-driven portals
Ontology managers (etc.)
Including semantic tools I didn't know what else to do with.
- Gnowsys (GNU)
“generic distributed network based memory/knowledge management”
- Protege
Other ‘smart application’ tools
Annotation handling:
- ANNIS2
(Univ. Potsdam) tools for manipulating and search data in
PAULA format
- AXE
(MITH, Univ. Maryland) Ajax XML Encoder, web-based tool for
tagging text, video, etc. with XML metadata
- CATMA
Computer Aided Textual Markup and Analysis
(Univ. Hamburg)
- CWB
Corpus Workbench (IMS, Univ. Stuttgart)
“(partial) support of structural annotations (e.g. SGML)”;
central component is CQP Corpus Query Processor
- Dexter
(Boston U.)
tools for stand-off annotation
- Elan
(Max Planck Institute for Psycholinguistics)
complex annotations on video and audio resources
- EXMARaLDA
(Univ. Hamburg) esp. for discourse analysis
- GATE
General Architecture for Text Engineering
(NLP group, Univ. Sheffield)
- iNote
(IATH, Univ. Virginia)
XML annotation of images
- ITE
Interlinear Text Editor
(Michel Jacobson)
- MMAX2
multi-modal annotation
- Monk
Metadata Offer New Knowledge - “digital environment”
for studying patterns in texts
- Nite XML Toolkit
(Univ. of Edinburgh) tools for managing heavily annotated corpora
- NLTK
Natural Language Tool Kit (Steven Bird, Edward Loper, Ewan Klein et al.)
Python modules for NLP
- PACX
Platform for Annotated Corpora in XML
- Rapid Miner
(originally Univ. Dortmund, now Rapid-I) data mining toolkit
providing simple operators combinable via GUI
- SACODEYL
(System Aided Compilation and Open Distribution of European Youth Language)
tools support multi-media, transcription, annotation, and search
in TEI P5 format
- SoundIndex
(Michel Jacobson)
authoring tool for text/sound synchronization
- Transcriber
(for transcription and annotation of spoken language)
- Xaira
XML Aware Indexing and Retrieval Architecture
(Oxford University Computing Services)
generalization of Sara (British National Corpus search tool)
Output
Display, styling, web delivery (but see also above):
- XML formatters
- images, graphics
- Web application development
Publishing / formatting
- FOP (Formatting
Objects Processor)
- Scribus
desktop publishing (uses XML internally)
- xmlroff XSL formatter
(focused on DocBook)
XML and graphics
- Batik (Apache) SVG renderer
- Inkscape GUI SVG editor
- librsvg (Gnome)
component to enable software to support SVG
Content management / publishing frameworks
- Apache Forrest (“publishing
framework”, based on Cocoon)
- Apache Lenya (“Java/XML
Content Management System”, based on Cocoon; “comes with revision control, multi-site management, scheduling, search, WYSIWYG editors, and workflow”.)
- Cocoon
- Cocoon ePub Compiler
(TEI → epub-formatted e-books)
- Flux CMS (Liip AG)
- Jackrabbit (Apache)
implementation of Java Content Repository (JCR) API
- Jackalope (Liip AG)
PHP / Jackrabbit bridge
- Okapi (Liip AG and local.ch)
small framework for building web applications, built on PHP and XSLT
- Orbeon Forms (Orbeon)
more than a forms processor
- teiPublisher
- Xcruciate
an “all-XML server project” with XSLT virtual machine,
API for building apps on it,
HTTP interfaces,
and matching content-management system
Thank you
Thank you.
Questions?
Group work
- How do you use open-source software?
- Which matters more to you:
- free as in speech?
- free as in beer?
- What software do you most want / need?
- What categories are missing from this categorization?
Miscellaneous additional material
Credits
- Ali, Saqib. Document Authoring Tools.
http://www.xml-dev.com/xml/editors.html.
- Anonymous [Thomas Schmidt?].
Linguistic Annotation Wiki.
http://annotation.exmaralda.org/index.php/Linguistic_Annotation
- Anonymous. Resource Guide -> Tools -> XML Editors
http://www.xml.com/pub/rg/XML_Editors
- Anonymous [Divers hands at Bouvet ASA].
Topic Maps Snippets:
News and links from the world of Topic Maps.
http://topicmaps.bouvet.no/blog/tag/tools/
- Bourret, Ron.
XML Database Products.
http://www.rpbourret.com/xml/XMLDatabaseProds.htm
- Cover, Robin.
Public SGML/XML Software.
http://xml.coverpages.org/publicSW.html
- Garshol, Lars Marius.
Free XML Tools
http://www.garshol.priv.no/xmltools/
- Garshol, Lars Marius.
Topic Maps Tools
http://www.garshol.priv.no/tmtools/
- Graham, Tony.
Testing XSLT.
http://www.menteithconsulting.com/wiki/TestingXSLT
- Marchiori, Massimo, and Liam Quin.
[List of XQuery and XPath implementations.]
2000-2009.
http://www.w3.org/XML/Query/#implementations
- Morgan, Eric Lease. Creating and managing
XML with open source software.
http://infomotions.com/musings/xml-with-oss/.
Also in Library Hi Tech
23.4 (2005): 526-540.
- Ogbuji, Uche.
The State of Python-XML in 2004.
XML.com, 13 October 2004.
http://www.xml.com/pub/a/2004/10/13/py-xml.html
- Sall, Ken. XML and Java: The Perfect Pair: Part 1
http://www.wdvl.com/Authoring/Languages/XML/Java/
- Walsh, Norman.
xprocimpl Bookmarks.
http://delicious.com/ndw/xprocimpl
- Walsh, Norman.
[XProc] Implementations.
http://xproc.org/implementations/
- van den Broek, Thijs, and Yiva Berlund.
Choosing an XML editor.
2004, rev. 2005.
http://ahds.ac.uk/creating/information-papers/xml-editors/
And thanks for their assistance to
Robin Berjon (Robineko),
Ron Bourret (rpbourret.com),
Anthony B. Coates (Londata),
Micah Dubinko (
xformsinstitute.com),
Michael Dyck,
Betty Harvey (Electronic Commerce Connection, Inc.),
Jirka Kosek,
Deborah Aleyne Lapeyre (Mulberry Technologies),
Steven R. Newcomb (Cool Heads),
Uche Ogbuji (Zepheira),
Liam Quin (W3C),
B. Tommie Usdin (Mulberry Technologies),
and
Mohamed Zergaoui (Innovimax).