Document modeling

Vocabulary design and definition

Introduction to XSD

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Rev. 16 October 2012

Nearby documents

Home

Overview

Organizational notes
Vocabulary tour
Specimen problems of vocabulary design
DTDs as a meta-language
Introduction to XSD
Assignments for 21 October

Organizational notes

Bureaucracy, paperwork, and so on ...

Scheduling future vocabulary tours

This week: Docbook (K. Fenlon, Little)

Future:

23 October: DITA (Lehnen, Kapauan)
30 October: ...
7 November: ...

Scheduling minute-taking

This week: Clark

Future:

16 October: Clark
23 October: Crist
30 October: Bilansky
6 November: K. Fenlon
13 November: Kapauan
20 November: Burke

Apologies; I will get the notes of earlier classes up very soon.

Questions?

Anything we need to clear up before proceeding?

Tour of vocabularies, cont'd

This week: Docbook

Katrina Fenlon
James Little

Where are we

Trying to keep some awareness of context.

Specimen problems of vocabulary specification

The general task

What does a vocabulary definition need to do?

A first answer:

define a set of documents*
- what's in?
- what's out?
- how do you tell the difference?
describe how to interpret* them
- how to process them?
- how to learn from them?

Metalanguage design

What does a language for writing vocabulary definitions need to do?

A first answer:

make it easy to define a set of documents* — provide convenient tools for saying
- what's in?
- what's out?
- provide a general mechanism to use for telling the difference
make it easy to describe how to interpret* them
- provide a language for saying how to process them?
- provide a language for saying how to learn from them?

Metalanguages for syntax

Current metalanguages for syntax mostly take three forms:

structure-based (record structures in PLs, DBMS)
grammar-based (BNF, EBNF, DTDs, XSD, Relax NG)
predicate-based (SQL CHECK clauses, Schematron, XSD assertions)

Plus combinations (of course).

Metalanguages for semantics

Current metalanguages for semantics take two forms:

translation into some other language* (denotational semantics, first-order logic, description logic, RDF, ...)
natural-language prose*

Plus combinations (of course).

What exactly does it mean to specify the semantics of a language? (Operational semantics? Declarative semantics?)

The syntax/semantics boundary

Many agree:

We know fairly well how to check syntax automatically. (formal grammars, BNF, parser generators, ...)
We don't know at all well how to check meaning automatically. (With exceptions for special cases.)

So many conclude:

What we know how to check automatically is syntax.
What we don't know how to automatically is semantics.

Over time, the boundary visibly shifts.

Where we're spending our time

We'll touch upon:

grammar-based definitions of syntax
- DTDs
- XSD
- Relax NG
predicate-based constraints (Schematron)
attempts to grapple with semantics
- techniques for documentation
- experimental and conjectural accounts

Additional requirements

In practice, our wants are more complex:

Multiplicity: public efforts define not one language but several* (TEI, HTML, DocBook, JATS).
Customization: public languages often designed for local adaptation (TEI, DocBook, JATS, DITA, ...).
Reuse: some wish to reuse others' work (semantics? syntax?).
Aggregation: some wish to combine multiple vocabularies. (Think: Excel embedded in Word, Word embedded in Excel and Access.)

Of these, the last has led to a concrete mechanism: XML namespaces.

XML Namespaces: two problems

(1) How do I distinguish my stuff from everybody else's stuff?

(2) How do we uniquely identify elements, attributes, and other things, defined by anyone at all, so they cannot be confused with things identified by other people?

XML Namespaces: one solution

(1) Put your stuff in a namespace different from from everybody else's namespace.

(2) Use URIs. (Namesapces do not solve this one, sorry.*)

XML Namespaces: how they work

Every name in an XML document is conceptually a pair:
- a namespace name (syntactically a URI)*
- a local name
Every name in an XML document is syntactically a pair:
- a prefix*
- a local name
separated by a colon*.
Every prefix in an XML document maps to a namespace name.

XML Namespaces: example

Consider this XForms document.

In the XML, we have

      <div>
	<h3>The document</h3>
	<p>The document we are constructing consists of a series of 
	logical formulas:</p>
	<xf:group ref=".[count(*) = 0]">
	  <p><em>The document is currently empty.</em></p>
	</xf:group>

In the XML, we also have

<html xmlns="http://www.w3.org/1999/xhtml" 
      xmlns:xf="http://www.w3.org/2002/xforms"
      ...
      >

XML Namespaces: unpacked

div in the XML expands to {http://www.w3.org/1999/xhtml}div
h3 in the XML expands to {http://www.w3.org/1999/xhtml}h3
p in the XML expands to {http://www.w3.org/1999/xhtml}p
xf:group in the XML expands to {http://www.w3.org/2002/xforms}group
em in the XML expands to {http://www.w3.org/1999/xhtml}em

So: the XForms group and the HTML group do not need to avoid each other's local names. Because they use different namespace names, their expanded names cannot conflict.

How do we evaluate vocabularies?

We evaluate vocabulary definitions on

how well they fit our documents
how well we can process them
how easy they are to learn and use

(and probably more).

How do we evaluate meta-languages?

We evaluate metalanguages on

the quality of the vocabulary definitions we can write with them
the ease with which we can use them to write good vocabulary definitions
the solutions they provide for problems we know we'll face

Some problems

What problems do we know we face?

deviance from the norm
extension
restriction
interoperability
reuse and recombination
documentation
choice of schema language

N.B. Not all of these have meta-language solutions!

Deviance from the norm

When 99% of the population follows simple rules, and a few outliers don't, how do we define the language?

E.g.

ababababababab

abababab

...

abaabababbababa

Can we accept that last example as legal, while still capturing the regularity of the others?

(Q. what is that regularity? How do you define it?)

Extension

Suppose we want a vocabulary just like [pick your favorite vocabulary here], except that we also want to add some elements for talking about programming-language source code:

<code> for arbitrary bits of source code
<ident> for identifiers
<lit> for string literals
<kw> for keywords
<scrap> for larger scraps of source code to be knit together into the program

All are phrase-level except <scrap>, which can go either inside or between paragraphs.

Restriction

Suppose we want a vocabulary just like [pick-favorite-here], except that we don't want the element <timeStamp> to be legal.

Or we like all the elements, but we don't like the model (front?, body, back?): we want <front> and <titlePage> to be required.

All our documents should be valid against the base vocabulary, but not vice versa.

Interoperability

Suppose we allow users of our public vocabulary to customize it in various ways.

How can we ensure that their vocabularies are still interoperable* in some sense?

How can we define the sense and the degree to which they are, or are not, interoperable?

Reuse and recombination

Suppose we want to define our own vocabulary, but we want to save effort by reusing:

HTML's analysis of lists (<ul>, <ol>)
TEI's phrase-level elements (<emph>, <term>, ...)
DocBook's elements for program listings
SGML Open's table markup
...

How do we define a language to make this easy to do?

Documentation

How do you document a vocabulary?

tutorial / introductory documentation
reference documentation
examples
edit-time hints?
processing requirements? expectations?

DTDs as a meta-language

In which we attempt to take stock of DTDs as a schema language.

Maintainability of DTDs

Production DTDs use parameter entities widely for maintainability. (Examples from TEI P3 DTD.)

inclusion of external files for ‘modularity’ (tei2.dtd, teipros2.dtd, teidict2.dtd, ...)
definition of content models and fragments (%paraContent;, %phrase.seq;)
definition of element classes (%m.hqphrase;, %m.date;, %m.seg;, %m.bibl;, %m.phrase;, %m.lists;, ...)
definition of common attributes (%a.global;)
definition of pseudo-types (%ISO-date;)

Other uses of PEs:

attribute defaults
enumerated types (for attributes)

Extensibility of DTDs

Production DTDs use parameter entities widely to allow extension by users. (Examples from TEI P3 DTD.)

inclusion of external extension files (%TEI.extensions.ent;, %TEI.extensions.dtd;)
extension of element classes (%x.hqphrase;, %x.date;, %x.seg;, ...)
renaming of elements (%n.TEI2;, %n.teiHeader;, ...)
conditional inclusion of declarations (%TEI.prose;, %TEI.verse;, etc., all have values IGNORE OR INCLUDE; conditional sections control inclusion)

DTDs and BNF

Any DTD can be translated into a context-free grammar / BNF.

<!ELEMENT x %x.model; >

⇒

x ::= start_x x.model end_x
start_x ::= "<x>" 
x.model ::= ... // translation of content model
end_x ::= "</x>"

One element declaration, one production rule.

Attributes handled separately, not in grammar.

Consequence: validation is (conceptually) easy.

DTDs and BNF, qualifications

SGML/XML DTDs resemble Backus-Naur Form grammars, but:

They describe bracketed languages* ...
... so ‘non-terminals’ are visible*.
SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete parsing problem for non-bracketed L).
They are not purely grammatical (notations, entities).
Determinism rule.
Elements and productions not necessarily 1 : 1.

A document grammar

Limericks and canzone:

poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen // aka piedi
stollen   ::= line+
abgesang  ::= line+ // aka cauda, sirima

A DTD

Limericks and canzone:

<!ELEMENT poem (limerick | canzone) >

<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

A limerick

<poem>
  <limerick>
    <trimeter>
      There was a young lady named Bright
    </trimeter>
    <trimeter>
      whose speed was much faster than light.
    </trimeter>
    <dimeter>She set out one day,</dimeter>
    <dimeter>in a relative way,</dimeter>
    <trimeter>
      and returned on the previous night.
    </trimeter>
  </limerick>
</poem>

A canzone

<poem>
  <canzone>
    <aufgesang>
      <stollen>
        <l>unter den linden an der heide</l>
        <l>da unser zweier bette was</l>
      </stollen>
      <stollen>
        <l>da mugt ir vinden schone beide</l>
        <l>gebrochen bluomen unde gras</l>
      </stollen>
    </aufgesang>
    <abgesang>
      <l>kuste er mich? wol tusentstunt</l>
      <l>tandaradei</l>
      <l>seht wie rot mir ist der munt</l>
    </abgesang>
  </canzone>
</poem>

Note on the canzone DTD

All the non-terminals show up as tags (e.g. <poem>)
The two Stollen must have same number of lines; this rule is not expressed.
The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed.

Removing non-terminals

<!ENTITY % poem     "(limerick | canzone)" >
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines     "l+" >
<!ELEMENT canzone   (%aufgesang;, abgesang) >
<!ELEMENT stollen   (%lines;) >
<!ELEMENT abgesang  (%lines;) >
<!ELEMENT l         (#PCDATA) >

The canzone minus explicit Aufgesang

<canzone>
  <stollen>
    <l>unter den linden an der heide</l>
    <l>da unser zweier bette was</l>
  </stollen>
  <stollen>
    <l>da mugt ir vinden schone beide</l>
    <l>gebrochen bluomen unde gras</l>
  </stollen>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

N.B. element : non-terminal no longer 1:1.

The canzone minus NTs

<canzone>
  <l>unter den linden an der heide</l>
  <l>da unser zweier bette was</l>
  <l>da mugt ir vinden schone beide</l>
  <l>gebrochen bluomen unde gras</l>
  <l>kuste er mich? wol tusentstunt</l>
  <l>tandaradei</l>
  <l>seht wie rot mir ist der munt</l>
</canzone>

Removing all non-terminals

<!ENTITY % stollen   "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang  "l+" >
<!ELEMENT canzone   (%aufgesang;, %abgesang;) >
<!ELEMENT l         (#PCDATA) >

ERROR: this DTD is illegal; why?

Advantages of DTDs

clarity
simplicity
well developed techniques for maintainability, extensibility

Shortcomings of DTDs

Among the shortcomings often noted:

ad hoc syntax (require own parser)
namespace support very weak
selection of attribute datatypes eccentric
attribute datatypes weak (no user-defined patterns, enumerations only)
no datatypes for element content
no way to say explicitly how elements are related (specializations, generalizations, same-structure)
determinism rule (need review? see slides of 2 October)
logical (element) structure and physical (entity) structure independent; why are they mixed in the same metalanguage?

Introduction to XSD

What's XSD?

XML Schema Definition Language

aka “XSDL”, “XML Schema”, “W3C XML Schema”, “WXS” (also “that @#$@%#$**&!! language”)

XSD 1.0 first draft 1999, W3C Recommendation 2001.

XSD 1.1 2012.

A product of struggle: data-heads vs doc-heads.

Key properties of XSD

DTD++, DTD--
instance syntax
supporting programming-language and database-oriented types
design problems

XSD and DTDs

Ways XSD resembles DTDs:

grammar-based
same element : production rule relation
determinism rule
attempts to replicate all DTD functionality (except entity declarations)*

* N.B. omission of entity declarations a technical / political issue, not a value judgement.

DTD--

DTD constructs XSD omits:

entity declarations
conditional inclusion of blocks

DTD++

Some ways XSD goes beyond DTDs:

rational support for namespaces
larger datatype system*
explicit relations among elements (substitutability)
explicit types for elements
explicit relations among types (derivation by restriction, extension)**
numeric occurrence indicators
assertions*
conditional type assignment**

* discussed later

** not discussed; you're on your own

The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.

At the outer level is a schema element:

<xsd:schema>
 <!--* element declarations go here *-->
</xsd:schema>

N.B. the schema does not identify a document-root element / start symbol.

Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Declaring elements

Note difference between element declaration (outer) and element reference (inner).
Implicit occurrence information: min = max = 1.

Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="l" type="xsd:string"/>

Supporting programming-language and dbms paradigms

tag/type distinction
named and anonymous datatypes
simple datatypes

The tag/type distinction

Let's generalize from

 <xsd:element name="l" type="xsd:string"/>

Can we do that for every element type?

N.B. four kinds of type

element type (vs. element, element instance)
data type
- simple type (lexical form has no markup)
- complex type (has element children)

In the example, l may be thought of as an accessor.

Top-level named complex types

Named types can be used to capture commonalities:

 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

Top-level complex types

... or just to provide a name for a type:

 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="canzone" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang">

Anonymous types

We can hide things using anonymous local types:

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name="aufgesang">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="stollen" type="lines"/>
       <xsd:element name="stollen" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
    <xsd:element name="abgesang" type="lines"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Note nested declarations and definitions.

Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:

restriction (preserves subset semantics)
extension (preserves prefix semantics)

Inheritance in document systems

Document systems turn out to have a very different model of class systems and inheritance.

inheritance of attributes
inheritance of locations

Design problems and research questions

7.18.1. Schemas and namespaces

Some (unpleasant) facts of life:

Namespaces allow us to distinguish mine from not-mine.
Namespaces do not provide universal names.
The namespace : language relation is 1:n.
The language : grammar relation is 1:n.
Therefore, the namespace : schema relation is 1:n.

Live with it.

7.18.2. Schema layers

We distinguish:

schema documents (with single target namespace)
schemas (sets of abstract components)

Schema composition operations:

import
include
include with override / redefine

7.18.3. Modularization

XML Schema makes it possible to write modular document type definitions:

late collection of schema components
namespace-aware name matching, validation
white-box wildcards (lax / opportunistic)
black-box wildcards (skip)

7.18.4. Linking document and schema

namespace name
schemaLocation hint

7.18.5. Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.

additions, no changes
type assignment information
validation-attempted information (strict, lax, skip)
validation-outcome information

7.18.6. Non-local effects

Consider the HTML input element:

legal only in p and similar elements
legal only within form elements

SGML DTDs have partial solutions:

inclusion exceptions
content models

7.18.7. Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:

 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>

One bit of context information = double the size of grammar.

Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

7.18.8. Determinism

The determinism rule remains controversial:

LL(1) guarantees may help implementors
All regular languages have a deterministic FSA;
... but not necessarily a deterministic regular expression!
Implications for closure under union, intersection.
Implications for subsumption tests.

Constructs to mention

Other constructs we will discuss if there is time:

abstract elements
named model groups

Assignments

Due: Sunday morning 21 October 2012.

Use declarations in the internal subset of a DTD (the part of the DTD internal to the XML document instance) to modify the XML DTD of ISO 8879 Annex E by adding <emph> and <term> elements as phrase-level elements (to occur wherever <hp1> can occur).

Optionally, add further elements at phrase-level or other levels, and document them in comments.

The next two assignments relate to the vocabulary described below.

Write a DTD for the vocabulary described.
Write an XSD schema document for the vocabulary described.

If any constraints expressed in the prose are unenforceable in either DTD or XSD notation, write the schema either to overgenerate (i.e. to accept all good documents and fail to reject some bad ones), or else to undergenerate (i.e. to reject all bad ones, at the cost of failing to accept some good ones).

Assignments, background

We defined a vocabulary consisting of the following elements, which obey the constraints indicated.

<doc> is the outermost element; it contains a title, an optional copyright statement, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.
<title> is the document title; it contains text (i.e. data characters) with phrase-level markup
<copyright> is the copyright statement; it contains a sequence of one or more paragraphs.
<sec> is a section; it contains a title, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.

(continued ...)

Assignments, background (cont'd)

Some elements are described here as being paragraph-level elements; that means they can occur at the same level as paragraphs, in sections and so on. (Sometimes we say “they occur between paragraphs”.)

<p> is a paragraph. It can always contain text and phrase-level markup. As a child of <doc> it can also contain notes and lists; as a child of <note> it can contain lists, but not notes.
<note> is a note. It contains a sequence of paragraph-level elements.
<list> is an itemized list. It contains a sequence of <item> elements.
<item> is a list item. You may choose whether it contains a sequence of paragraphs, or a sequence of paragraph-level elements. (N.B. <item> is not a paragraph-level element; it's listed here to be close to <list>.)

(continued ...)

Assignments, background (cont'd)

The phrase-level elements are:

<emph> (for rhetorical emphasis)
<term> (for technical terms)
<cited> (for titles of books and articles cited)
<ital> (for italics not otherwise accounted for)
<bold> (for bold not otherwise accounted for)

All phrase level elements can contain character data and phrase-level elements.

Document modeling

Vocabulary design and definition

Introduction to XSD

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Rev. 16 October 2012

Nearby documents

Overview

Organizational notes

Scheduling future vocabulary tours

Scheduling minute-taking

Questions?

Tour of vocabularies, cont'd

Where are we

You are here

Specimen problems of vocabulary specification

The general task

Metalanguage design

Metalanguages for syntax

Metalanguages for semantics

The syntax/semantics boundary

Where we're spending our time

Additional requirements

XML Namespaces: two problems

XML Namespaces: one solution

XML Namespaces: how they work

XML Namespaces: example

XML Namespaces: unpacked

How do we evaluate vocabularies?

How do we evaluate meta-languages?

Some problems

Deviance from the norm

Extension

Restriction

Interoperability

Reuse and recombination

Documentation

DTDs as a meta-language

Maintainability of DTDs

Extensibility of DTDs

DTDs and BNF

DTDs and BNF, qualifications

A document grammar

A DTD

A limerick

A canzone

Note on the canzone DTD

Removing non-terminals

The canzone minus explicit Aufgesang

The canzone minus NTs

Removing all non-terminals

Advantages of DTDs

Shortcomings of DTDs

Introduction to XSD

What's XSD?

Key properties of XSD

XSD and DTDs

DTD--

DTD++

The canzone schema v.1

Declaring elements

Declaring elements

Positive closure

Character data

Supporting programming-language and dbms paradigms

The tag/type distinction

Top-level named complex types

Top-level complex types

Anonymous types

Inheritance Type derivation

Inheritance in document systems

Design problems and research questions

7.18.1. Schemas and namespaces

7.18.2. Schema layers

7.18.3. Modularization

7.18.4. Linking document and schema

7.18.5. Post-schema-validation infoset (PSVI)

7.18.6. Non-local effects

7.18.7. Non-local effects in XML Schema

7.18.8. Determinism

Constructs to mention