Document modeling
Vocabulary design and definition
Introduction to XSD
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
Rev. 16 October 2012
Overview
- Organizational notes
- Vocabulary tour
- Specimen problems of vocabulary design
- DTDs as a meta-language
- Introduction to XSD
- Assignments for 21 October
Organizational notes
Bureaucracy, paperwork, and so on ...
Scheduling future vocabulary tours
This week: Docbook (K. Fenlon, Little)
Future:
- 23 October: DITA (Lehnen, Kapauan)
- 30 October: ...
- 7 November: ...
Scheduling minute-taking
This week: Clark
Future:
- 16 October: Clark
- 23 October: Crist
- 30 October: Bilansky
- 6 November: K. Fenlon
- 13 November: Kapauan
- 20 November: Burke
Apologies; I will get the notes of
earlier classes up very soon.
Questions?
Anything we need to clear up before proceeding?
Tour of vocabularies, cont'd
This week: Docbook
- Katrina Fenlon
- James Little
Where are we
Trying to keep some awareness of context.
You are here
Overview
Representation of individual documents
- introduction to SGML and XML
- syntax (angle brackets)
- model (trees)
- DTDs
- historical survey
Schema languages
- → DTDs
- → XSD
- Relax NG
- Schematron
Semantics
Student projects
Wrap-up
Specimen problems of vocabulary specification
The general task
What does a vocabulary definition need to do?
A first answer:
- define a set of documents*
- what's in?
- what's out?
- how do you tell the difference?
- describe how to interpret* them
- how to process them?
- how to learn from them?
Metalanguage design
What does a language for writing
vocabulary definitions need to do?
A first answer:
- make it easy to
define a set of documents* —
provide convenient tools for saying
- what's in?
- what's out?
- provide a general mechanism to
use for telling the difference
- make it easy to
describe how to interpret* them
- provide a language for saying
how to process them?
- provide a language for saying
how to learn from them?
Metalanguages for syntax
Current metalanguages for syntax mostly take three forms:
- structure-based (record structures in PLs, DBMS)
- grammar-based (BNF, EBNF, DTDs, XSD, Relax NG)
- predicate-based (SQL CHECK clauses,
Schematron, XSD assertions)
Plus combinations (of course).
Metalanguages for semantics
Current metalanguages for semantics take two forms:
- translation into some other language*
(denotational semantics, first-order logic,
description logic, RDF, ...)
- natural-language prose*
Plus combinations (of course).
What exactly does it mean to specify the
semantics of a language? (Operational semantics?
Declarative semantics?)
The syntax/semantics boundary
Many agree:
- We know fairly well how to check syntax automatically.
(formal grammars, BNF, parser generators, ...)
- We don't know at all well how to check meaning automatically.
(With exceptions for special cases.)
So many conclude:
- What we know how to check automatically is
syntax.
- What we don't know how to automatically is
semantics.
Over time, the boundary visibly shifts.
Where we're spending our time
We'll touch upon:
- grammar-based definitions of syntax
- predicate-based constraints (Schematron)
- attempts to grapple with semantics
- techniques for documentation
- experimental and conjectural accounts
Additional requirements
In practice, our wants are more complex:
- Multiplicity: public efforts define not one language
but several* (TEI, HTML, DocBook, JATS).
- Customization: public languages often designed for local
adaptation (TEI, DocBook, JATS, DITA, ...).
- Reuse: some wish to reuse others' work (semantics? syntax?).
- Aggregation: some wish to combine multiple vocabularies.
(Think: Excel embedded in Word, Word embedded in Excel and Access.)
Of these, the last has led to a concrete mechanism:
XML namespaces.
XML Namespaces: two problems
(1) How do I distinguish
my stuff
from
everybody else's stuff?
(2) How do we uniquely identify elements,
attributes, and other things, defined by anyone
at all, so they cannot
be confused with things identified by other
people?
XML Namespaces: one solution
(1) Put your stuff in a
namespace different from
from
everybody else's
namespace.
(2) Use URIs. (Namesapces do not solve
this one, sorry.*)
XML Namespaces: how they work
- Every name in an XML document is conceptually
a pair:
- a namespace name (syntactically a URI)*
- a local name
- Every name in an XML document is syntactically
a pair:
separated by a colon*.
- Every prefix in an XML document maps to
a namespace name.
XML Namespaces: example
In the XML, we have
<div>
<h3>The document</h3>
<p>The document we are constructing consists of a series of
logical formulas:</p>
<xf:group ref=".[count(*) = 0]">
<p><em>The document is currently empty.</em></p>
</xf:group>
In the XML, we also have
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:xf="http://www.w3.org/2002/xforms"
...
>
XML Namespaces: unpacked
So
- div in the XML expands to
{http://www.w3.org/1999/xhtml}div
- h3 in the XML expands to
{http://www.w3.org/1999/xhtml}h3
- p in the XML expands to
{http://www.w3.org/1999/xhtml}p
- xf:group in the XML expands to
{http://www.w3.org/2002/xforms}group
- em in the XML expands to
{http://www.w3.org/1999/xhtml}em
So: the XForms group and the HTML group do
not need to avoid each other's local names.
Because they use different namespace names,
their expanded names cannot conflict.
How do we evaluate vocabularies?
We evaluate vocabulary definitions on
- how well they fit our documents
- how well we can process them
- how easy they are to learn and use
(and probably more).
How do we evaluate meta-languages?
We evaluate metalanguages on
- the quality of the vocabulary definitions
we can write with them
- the ease with which we can use them to write
good vocabulary definitions
- the solutions they provide for problems
we know we'll face
Some problems
What problems do we know we face?
- deviance from the norm
- extension
- restriction
- interoperability
- reuse and recombination
- documentation
- choice of schema language
N.B.
Not all of these have meta-language solutions!
Deviance from the norm
When 99% of the population follows simple
rules, and a few outliers don't,
how do we define the language?
E.g.
ab
ababababababab
abababab
...
abaabababbababa
Can we accept that last example as legal, while
still capturing the regularity of the others?
(Q. what is that regularity?
How do you define it?)
Extension
Suppose we want a vocabulary just like
[pick your favorite vocabulary here], except that
we also want to add some elements for
talking about programming-language source code:
- <code> for arbitrary bits of source code
- <ident> for identifiers
- <lit> for string literals
- <kw> for keywords
- <scrap> for larger scraps of source code
to be knit together into the program
All are phrase-level except <
scrap>,
which can go either inside or between paragraphs.
Restriction
Suppose we want a vocabulary just like
[pick-favorite-here],
except that we don't want the element
<timeStamp> to be legal.
Or we like all the elements, but we
don't like the model (front?, body, back?):
we want <front> and <titlePage> to be
required.
All our documents should be valid against the
base vocabulary, but not vice versa.
Interoperability
Suppose we allow users of our public vocabulary
to customize it in various ways.
How can we ensure that their vocabularies
are still interoperable* in some sense?
How can we define the sense and the degree
to which they are, or are not, interoperable?
Reuse and recombination
Suppose we want to define our own vocabulary,
but we want to save effort by reusing:
- HTML's analysis of lists (<ul>, <ol>)
- TEI's phrase-level elements (<emph>,
<term>, ...)
- DocBook's elements for program listings
- SGML Open's table markup
- ...
How do we define a language to make
this easy to do?
Documentation
How do you document a vocabulary?
- tutorial / introductory documentation
- reference documentation
- examples
- edit-time hints?
- processing requirements? expectations?
DTDs as a meta-language
In which we attempt to take stock
of DTDs as a schema language.
Maintainability of DTDs
Production DTDs use parameter entities widely
for maintainability. (Examples from TEI P3 DTD.)
- inclusion of external files for ‘modularity’
(tei2.dtd, teipros2.dtd,
teidict2.dtd, ...)
- definition of content models and fragments
(%paraContent;,
%phrase.seq;)
- definition of element classes
(%m.hqphrase;,
%m.date;,
%m.seg;,
%m.bibl;,
%m.phrase;,
%m.lists;,
...)
- definition of common attributes
(%a.global;)
- definition of pseudo-types (%ISO-date;)
Other uses of PEs:
- attribute defaults
- enumerated types (for attributes)
Extensibility of DTDs
Production DTDs use parameter entities widely
to allow extension by users. (Examples from TEI P3 DTD.)
- inclusion of external extension files
(%TEI.extensions.ent;,
%TEI.extensions.dtd;)
- extension of element classes
(%x.hqphrase;,
%x.date;,
%x.seg;,
...)
- renaming of elements
(%n.TEI2;,
%n.teiHeader;, ...)
- conditional inclusion of declarations
(%TEI.prose;,
%TEI.verse;, etc., all
have values IGNORE OR INCLUDE;
conditional sections control inclusion)
DTDs and BNF
Any DTD can be translated into a context-free grammar / BNF.
<!ELEMENT x %x.model; >
⇒
x ::= start_x x.model end_x
start_x ::= "<x>"
x.model ::= ... // translation of content model
end_x ::= "</x>"
One element declaration, one production rule.
Attributes handled separately, not in grammar.
Consequence: validation is (conceptually) easy.
DTDs and BNF, qualifications
SGML/XML DTDs resemble Backus-Naur Form grammars, but:
- They describe bracketed languages* ...
- ... so ‘non-terminals’ are visible*.
- SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete
parsing problem for non-bracketed L).
- They are not purely grammatical (notations, entities).
- Determinism rule.
- Elements and productions not necessarily 1 : 1.
A document grammar
Limericks and canzone:
poem ::= limerick | canzone
limerick ::= trimeter trimeter dimeter
dimeter trimeter
trimeter ::= CHAR+
dimeter ::= CHAR+
canzone ::= aufgesang abgesang
aufgesang ::= stollen stollen // aka piedi
stollen ::= line+
abgesang ::= line+ // aka cauda, sirima
A DTD
Limericks and canzone:
<!ELEMENT poem (limerick | canzone) >
<!ELEMENT limerick (trimeter, trimeter,
dimeter, dimeter,
trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter (#PCDATA)>
<!ELEMENT canzone (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen (l+) >
<!ELEMENT abgesang (l+) >
<!ELEMENT l (#PCDATA) >
A limerick
<poem>
<limerick>
<trimeter>
There was a young lady named Bright
</trimeter>
<trimeter>
whose speed was much faster than light.
</trimeter>
<dimeter>She set out one day,</dimeter>
<dimeter>in a relative way,</dimeter>
<trimeter>
and returned on the previous night.
</trimeter>
</limerick>
</poem>
A canzone
<poem>
<canzone>
<aufgesang>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
</aufgesang>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
</poem>
Note on the canzone DTD
- All the non-terminals show up as tags
(e.g. <poem>)
- The two Stollen must have same number of
lines; this rule is not expressed.
- The Abgesang must have more lines than a
Stollen, fewer than Aufgesang; this rule is not expressed.
Removing non-terminals
<!ENTITY % poem "(limerick | canzone)" >
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines "l+" >
<!ELEMENT canzone (%aufgesang;, abgesang) >
<!ELEMENT stollen (%lines;) >
<!ELEMENT abgesang (%lines;) >
<!ELEMENT l (#PCDATA) >
The canzone minus explicit Aufgesang
<canzone>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
N.B. element : non-terminal
no longer 1:1.
The canzone minus NTs
<canzone>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</canzone>
Removing all non-terminals
<!ENTITY % stollen "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang "l+" >
<!ELEMENT canzone (%aufgesang;, %abgesang;) >
<!ELEMENT l (#PCDATA) >
ERROR: this DTD is illegal; why?
Advantages of DTDs
- clarity
- simplicity
- well developed techniques for
maintainability, extensibility
Shortcomings of DTDs
Among the shortcomings often noted:
- ad hoc syntax (require own parser)
- namespace support very weak
- selection of attribute datatypes eccentric
- attribute datatypes weak (no user-defined patterns, enumerations only)
- no datatypes for element content
- no way to say explicitly how elements
are related (specializations, generalizations,
same-structure)
- determinism rule
(need review? see slides of 2
October)
- logical (element) structure
and physical (entity) structure
independent; why are they mixed
in the same metalanguage?
Introduction to XSD
What's XSD?
XML Schema Definition Language
aka “XSDL”,
“XML Schema”,
“W3C XML Schema”,
“WXS”
(also “that @#$@%#$**&!! language”)
XSD 1.0 first draft 1999, W3C Recommendation 2001.
XSD 1.1 2012.
A product of struggle:
data-heads vs doc-heads.
Key properties of XSD
- DTD++, DTD--
- instance syntax
- supporting programming-language and database-oriented
types
- design problems
XSD and DTDs
Ways XSD resembles DTDs:
- grammar-based
- same element : production rule relation
- determinism rule
- attempts to replicate all DTD functionality
(except entity declarations)*
* N.B. omission of entity declarations a
technical / political issue, not a value
judgement.
DTD--
DTD constructs XSD omits:
- entity declarations
- conditional inclusion of blocks
DTD++
Some ways XSD goes beyond DTDs:
- rational support for namespaces
- larger datatype system*
- explicit relations among elements
(substitutability)
- explicit types for elements
- explicit relations among types
(derivation by restriction, extension)**
- numeric occurrence indicators
- assertions*
- conditional type assignment**
* discussed later
** not discussed; you're on your own
The canzone schema v.1
In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a
schema element:
<xsd:schema>
<!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify
a document-root element / start symbol.
Declaring elements
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Declaring elements
- Note difference between element declaration (outer)
and element reference (inner).
- Implicit occurrence information: min = max = 1.
Positive closure
<xsd:element name="abgesang">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="stollen">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Character data
<xsd:element name="l">
<xsd:complexType mixed="true">
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
or
<xsd:element name="l" type="xsd:string"/>
Supporting programming-language and dbms paradigms
- tag/type distinction
- named and anonymous datatypes
- simple datatypes
The tag/type distinction
Let's generalize from
<xsd:element name="l" type="xsd:string"/>
Can we do that for every element type?
N.B. four kinds of
type- element type (vs. element, element instance)
- data type
- simple type (lexical form has no markup)
- complex type (has element children)
In the example,
l may be thought of as an
accessor.
Top-level named complex types
Named types can be used to capture commonalities:
<xsd:complexType name="lines">
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="abgesang" type="lines">
<xsd:element name="stollen" type="lines">
Top-level complex types
... or just to provide a name for a type:
<xsd:complexType name="canzoneform">
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="aufgesang">
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="canzone" type="canzoneform"/>
<xsd:element name="aufgesang" type="aufgesang">
Anonymous types
We can hide things using
anonymous local types:
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="stollen" type="lines"/>
<xsd:element name="stollen" type="lines"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="abgesang" type="lines"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Note nested declarations and definitions.
Inheritance Type derivation
It turns out to be hard to model stepwise refinement
of types:
- restriction (preserves subset semantics)
- extension (preserves prefix semantics)
Inheritance in document systems
Document systems turn out to have a very different model
of class systems and inheritance.
- inheritance of attributes
- inheritance of locations
Design problems and research questions
7.18.1. Schemas and namespaces
Some (unpleasant) facts of life:
- Namespaces allow us to distinguish mine
from not-mine.
- Namespaces do not provide universal names.
- The namespace : language
relation is 1:n.
- The language : grammar
relation is 1:n.
- Therefore, the namespace : schema
relation is 1:n.
Live with it.
We distinguish:
- schema documents (with single target namespace)
- schemas (sets of abstract components)
Schema composition operations:
- import
- include
- include with override / redefine
7.18.3. Modularization
XML Schema makes it possible to write modular document
type definitions:
- late collection of schema components
- namespace-aware name matching, validation
- white-box wildcards (lax / opportunistic)
- black-box wildcards (skip)
7.18.4. Linking document and schema
- namespace name
- schemaLocation hint
7.18.5. Post-schema-validation infoset (PSVI)
XML-Schema validation: infoset → infoset.
- additions, no changes
- type assignment information
- validation-attempted information (strict, lax, skip)
- validation-outcome information
7.18.6. Non-local effects
Consider the HTML
input element:
- legal only in p and similar elements
- legal only within form elements
SGML DTDs have partial solutions:
- inclusion exceptions
- content models
7.18.7. Non-local effects in XML Schema
Fundamentally, we trade verbosity for context-sensitivity:
<xsd:element name="div" type="div-type"/>
<xsd:element name="div" type="div-in-form-type"/>
<xsd:element name="p" type="p-type"/>
<xsd:element name="p" type="p-in-form-type"/>
<xsd:element name="ul" type="ul-type"/>
<xsd:element name="ul" type="ul-in-form-type"/>
<xsd:element name="li" type="li-type"/>
<xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of
context sensitivity).
The determinism rule remains controversial:
- LL(1) guarantees may help implementors
- All regular languages have a deterministic FSA;
- ... but not necessarily a deterministic
regular expression!
- Implications for closure under union, intersection.
- Implications for subsumption tests.
Constructs to mention
Other constructs we will discuss if there
is time:
- abstract elements
- named model groups
Assignments
Assignments
Due: Sunday morning 21 October 2012.
Use declarations in the internal subset of a DTD
(the part of the DTD internal to the XML document instance) to modify
the XML DTD of ISO 8879 Annex E by adding <emph>
and <term> elements as phrase-level elements (to occur
wherever <hp1> can occur).
Optionally, add further elements at
phrase-level or other levels, and document them
in comments.
The next two assignments relate to the vocabulary described
below.
Write a DTD for the vocabulary described.
Write an XSD schema document for the vocabulary
described.
If any constraints expressed in the prose are unenforceable in either
DTD or XSD notation, write the schema either to
overgenerate (i.e. to accept all good documents and fail to reject
some bad ones), or else to undergenerate (i.e. to reject
all bad ones, at the cost of failing to accept some good ones).
Assignments, background
We defined a vocabulary consisting of the following elements,
which obey the constraints indicated.
- <doc> is the outermost element; it contains
a title, an optional copyright statement, a sequence of
zero or more paragraph-level elements, and
a sequence of zero or more sections (<div> elements),
in that order.
- <title> is the document title; it contains
text (i.e. data characters) with phrase-level markup
- <copyright> is the copyright statement; it contains
a sequence of one or more paragraphs.
- <sec> is a section; it contains
a title, a sequence of
zero or more paragraph-level elements, and
a sequence of zero or more sections (<div> elements),
in that order.
(continued ...)
Assignments, background (cont'd)
Some elements are described here as being
paragraph-level
elements; that means they can occur at the same level
as paragraphs, in sections and so on. (Sometimes we say
“they occur between paragraphs”.)
- <p> is a paragraph. It can always contain
text and phrase-level markup. As a child of <doc> it can
also contain notes and lists; as a child of <note> it
can contain lists, but not notes.
- <note> is a note. It contains a sequence
of paragraph-level elements.
- <list> is an itemized list. It contains a sequence
of <item> elements.
- <item> is a list item. You may choose whether
it contains a sequence of paragraphs,
or a sequence of paragraph-level elements.
(N.B. <item> is not a paragraph-level element;
it's listed here to be close to <list>.)
(continued ...)
Assignments, background (cont'd)
The phrase-level elements are:
- <emph> (for rhetorical emphasis)
- <term> (for technical terms)
- <cited> (for titles of books and articles cited)
- <ital> (for italics not otherwise accounted for)
- <bold> (for bold not otherwise accounted for)
All phrase level elements can contain character data and
phrase-level elements.