Document modeling

Vocabulary design and definition

Introduction to XSD

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Rev. 16 October 2012

Nearby documents


Overview

Organizational notes

Bureaucracy, paperwork, and so on ...

Scheduling future vocabulary tours

This week: Docbook (K. Fenlon, Little)
Future:
  • 23 October: DITA (Lehnen, Kapauan)
  • 30 October: ...
  • 7 November: ...

Scheduling minute-taking

This week: Clark
Future:
  • 16 October: Clark
  • 23 October: Crist
  • 30 October: Bilansky
  • 6 November: K. Fenlon
  • 13 November: Kapauan
  • 20 November: Burke
Apologies; I will get the notes of earlier classes up very soon.

Questions?

Anything we need to clear up before proceeding?

Tour of vocabularies, cont'd

This week: Docbook

Where are we

Trying to keep some awareness of context.

You are here

Overview
Representation of individual documents
  • introduction to SGML and XML
    • syntax (angle brackets)
    • model (trees)
    • DTDs
  • historical survey
Schema languages
  • → DTDs
  • → XSD
  • Relax NG
  • Schematron
Semantics
Student projects
Wrap-up

Specimen problems of vocabulary specification

The general task

What does a vocabulary definition need to do?
A first answer:
  • define a set of documents*
    • what's in?
    • what's out?
    • how do you tell the difference?
  • describe how to interpret* them
    • how to process them?
    • how to learn from them?

Metalanguage design

What does a language for writing vocabulary definitions need to do?
A first answer:
  • make it easy to define a set of documents* — provide convenient tools for saying
    • what's in?
    • what's out?
    • provide a general mechanism to use for telling the difference
  • make it easy to describe how to interpret* them
    • provide a language for saying how to process them?
    • provide a language for saying how to learn from them?

Metalanguages for syntax

Current metalanguages for syntax mostly take three forms:
  • structure-based (record structures in PLs, DBMS)
  • grammar-based (BNF, EBNF, DTDs, XSD, Relax NG)
  • predicate-based (SQL CHECK clauses, Schematron, XSD assertions)
Plus combinations (of course).

Metalanguages for semantics

Current metalanguages for semantics take two forms:
  • translation into some other language* (denotational semantics, first-order logic, description logic, RDF, ...)
  • natural-language prose*
Plus combinations (of course).
What exactly does it mean to specify the semantics of a language? (Operational semantics? Declarative semantics?)

The syntax/semantics boundary

Many agree:
  • We know fairly well how to check syntax automatically. (formal grammars, BNF, parser generators, ...)
  • We don't know at all well how to check meaning automatically. (With exceptions for special cases.)
So many conclude:
  • What we know how to check automatically is syntax.
  • What we don't know how to automatically is semantics.
Over time, the boundary visibly shifts.

Where we're spending our time

We'll touch upon:
  • grammar-based definitions of syntax
    • DTDs
    • XSD
    • Relax NG
  • predicate-based constraints (Schematron)
  • attempts to grapple with semantics
    • techniques for documentation
    • experimental and conjectural accounts

Additional requirements

In practice, our wants are more complex:
  • Multiplicity: public efforts define not one language but several* (TEI, HTML, DocBook, JATS).
  • Customization: public languages often designed for local adaptation (TEI, DocBook, JATS, DITA, ...).
  • Reuse: some wish to reuse others' work (semantics? syntax?).
  • Aggregation: some wish to combine multiple vocabularies. (Think: Excel embedded in Word, Word embedded in Excel and Access.)
Of these, the last has led to a concrete mechanism: XML namespaces.

XML Namespaces: two problems

(1) How do I distinguish my stuff from everybody else's stuff?
(2) How do we uniquely identify elements, attributes, and other things, defined by anyone at all, so they cannot be confused with things identified by other people?

XML Namespaces: one solution

(1) Put your stuff in a namespace different from from everybody else's namespace.
(2) Use URIs. (Namesapces do not solve this one, sorry.*)

XML Namespaces: how they work

  1. Every name in an XML document is conceptually a pair:
    • a namespace name (syntactically a URI)*
    • a local name
  2. Every name in an XML document is syntactically a pair:
    • a prefix*
    • a local name
    separated by a colon*.
  3. Every prefix in an XML document maps to a namespace name.

XML Namespaces: example

In the XML, we have
      <div>
	<h3>The document</h3>
	<p>The document we are constructing consists of a series of 
	logical formulas:</p>
	<xf:group ref=".[count(*) = 0]">
	  <p><em>The document is currently empty.</em></p>
	</xf:group>
In the XML, we also have
<html xmlns="http://www.w3.org/1999/xhtml" 
      xmlns:xf="http://www.w3.org/2002/xforms"
      ...
      >

XML Namespaces: unpacked

So
  • div in the XML expands to {http://www.w3.org/1999/xhtml}div
  • h3 in the XML expands to {http://www.w3.org/1999/xhtml}h3
  • p in the XML expands to {http://www.w3.org/1999/xhtml}p
  • xf:group in the XML expands to {http://www.w3.org/2002/xforms}group
  • em in the XML expands to {http://www.w3.org/1999/xhtml}em
So: the XForms group and the HTML group do not need to avoid each other's local names. Because they use different namespace names, their expanded names cannot conflict.

How do we evaluate vocabularies?

We evaluate vocabulary definitions on
  • how well they fit our documents
  • how well we can process them
  • how easy they are to learn and use
(and probably more).

How do we evaluate meta-languages?

We evaluate metalanguages on
  • the quality of the vocabulary definitions we can write with them
  • the ease with which we can use them to write good vocabulary definitions
  • the solutions they provide for problems we know we'll face

Some problems

What problems do we know we face?
  • deviance from the norm
  • extension
  • restriction
  • interoperability
  • reuse and recombination
  • documentation
  • choice of schema language

N.B. Not all of these have meta-language solutions!

Deviance from the norm

When 99% of the population follows simple rules, and a few outliers don't, how do we define the language?
E.g.
ab
ababababababab
abababab
...
abaabababbababa
Can we accept that last example as legal, while still capturing the regularity of the others?
(Q. what is that regularity? How do you define it?)

Extension

Suppose we want a vocabulary just like [pick your favorite vocabulary here], except that we also want to add some elements for talking about programming-language source code:
  • <code> for arbitrary bits of source code
  • <ident> for identifiers
  • <lit> for string literals
  • <kw> for keywords
  • <scrap> for larger scraps of source code to be knit together into the program
All are phrase-level except <scrap>, which can go either inside or between paragraphs.

Restriction

Suppose we want a vocabulary just like [pick-favorite-here], except that we don't want the element <timeStamp> to be legal.
Or we like all the elements, but we don't like the model (front?, body, back?): we want <front> and <titlePage> to be required.
All our documents should be valid against the base vocabulary, but not vice versa.

Interoperability

Suppose we allow users of our public vocabulary to customize it in various ways.
How can we ensure that their vocabularies are still interoperable* in some sense?
How can we define the sense and the degree to which they are, or are not, interoperable?

Reuse and recombination

Suppose we want to define our own vocabulary, but we want to save effort by reusing:
  • HTML's analysis of lists (<ul>, <ol>)
  • TEI's phrase-level elements (<emph>, <term>, ...)
  • DocBook's elements for program listings
  • SGML Open's table markup
  • ...
How do we define a language to make this easy to do?

Documentation

How do you document a vocabulary?
  • tutorial / introductory documentation
  • reference documentation
  • examples
  • edit-time hints?
  • processing requirements? expectations?

DTDs as a meta-language

In which we attempt to take stock of DTDs as a schema language.

Maintainability of DTDs

Production DTDs use parameter entities widely for maintainability. (Examples from TEI P3 DTD.)
  • inclusion of external files for ‘modularity’ (tei2.dtd, teipros2.dtd, teidict2.dtd, ...)
  • definition of content models and fragments (%paraContent;, %phrase.seq;)
  • definition of element classes (%m.hqphrase;, %m.date;, %m.seg;, %m.bibl;, %m.phrase;, %m.lists;, ...)
  • definition of common attributes (%a.global;)
  • definition of pseudo-types (%ISO-date;)
Other uses of PEs:
  • attribute defaults
  • enumerated types (for attributes)

Extensibility of DTDs

Production DTDs use parameter entities widely to allow extension by users. (Examples from TEI P3 DTD.)
  • inclusion of external extension files (%TEI.extensions.ent;, %TEI.extensions.dtd;)
  • extension of element classes (%x.hqphrase;, %x.date;, %x.seg;, ...)
  • renaming of elements (%n.TEI2;, %n.teiHeader;, ...)
  • conditional inclusion of declarations (%TEI.prose;, %TEI.verse;, etc., all have values IGNORE OR INCLUDE; conditional sections control inclusion)

DTDs and BNF

Any DTD can be translated into a context-free grammar / BNF.
<!ELEMENT x %x.model; >
x ::= start_x x.model end_x
start_x ::= "<x>" 
x.model ::= ... // translation of content model
end_x ::= "</x>" 
One element declaration, one production rule.
Attributes handled separately, not in grammar.
Consequence: validation is (conceptually) easy.

DTDs and BNF, qualifications

SGML/XML DTDs resemble Backus-Naur Form grammars, but:
  • They describe bracketed languages* ...
  • ... so ‘non-terminals’ are visible*.
  • SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete parsing problem for non-bracketed L).
  • They are not purely grammatical (notations, entities).
  • Determinism rule.
  • Elements and productions not necessarily 1 : 1.

A document grammar

Limericks and canzone:
poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen // aka piedi
stollen   ::= line+
abgesang  ::= line+ // aka cauda, sirima

A DTD

Limericks and canzone:
<!ELEMENT poem (limerick | canzone) >

<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

A limerick

<poem>
  <limerick>
    <trimeter>
      There was a young lady named Bright
    </trimeter>
    <trimeter>
      whose speed was much faster than light.
    </trimeter>
    <dimeter>She set out one day,</dimeter>
    <dimeter>in a relative way,</dimeter>
    <trimeter>
      and returned on the previous night.
    </trimeter>
  </limerick>
</poem>

A canzone

<poem>
  <canzone>
    <aufgesang>
      <stollen>
        <l>unter den linden an der heide</l>
        <l>da unser zweier bette was</l>
      </stollen>
      <stollen>
        <l>da mugt ir vinden schone beide</l>
        <l>gebrochen bluomen unde gras</l>
      </stollen>
    </aufgesang>
    <abgesang>
      <l>kuste er mich? wol tusentstunt</l>
      <l>tandaradei</l>
      <l>seht wie rot mir ist der munt</l>
    </abgesang>
  </canzone>
</poem>

Note on the canzone DTD

Removing non-terminals

<!ENTITY % poem     "(limerick | canzone)" >
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines     "l+" >
<!ELEMENT canzone   (%aufgesang;, abgesang) >
<!ELEMENT stollen   (%lines;) >
<!ELEMENT abgesang  (%lines;) >
<!ELEMENT l         (#PCDATA) >

The canzone minus explicit Aufgesang

<canzone>
  <stollen>
    <l>unter den linden an der heide</l>
    <l>da unser zweier bette was</l>
  </stollen>
  <stollen>
    <l>da mugt ir vinden schone beide</l>
    <l>gebrochen bluomen unde gras</l>
  </stollen>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

N.B. element : non-terminal no longer 1:1.

The canzone minus NTs

<canzone>
  <l>unter den linden an der heide</l>
  <l>da unser zweier bette was</l>
  <l>da mugt ir vinden schone beide</l>
  <l>gebrochen bluomen unde gras</l>
  <l>kuste er mich? wol tusentstunt</l>
  <l>tandaradei</l>
  <l>seht wie rot mir ist der munt</l>
</canzone>

Removing all non-terminals

<!ENTITY % stollen   "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang  "l+" >
<!ELEMENT canzone   (%aufgesang;, %abgesang;) >
<!ELEMENT l         (#PCDATA) >
ERROR: this DTD is illegal; why?

Advantages of DTDs

  • clarity
  • simplicity
  • well developed techniques for maintainability, extensibility

Shortcomings of DTDs

Among the shortcomings often noted:
  • ad hoc syntax (require own parser)
  • namespace support very weak
  • selection of attribute datatypes eccentric
  • attribute datatypes weak (no user-defined patterns, enumerations only)
  • no datatypes for element content
  • no way to say explicitly how elements are related (specializations, generalizations, same-structure)
  • determinism rule (need review? see slides of 2 October)
  • logical (element) structure and physical (entity) structure independent; why are they mixed in the same metalanguage?

Introduction to XSD

What's XSD?

XML Schema Definition Language
aka “XSDL”, “XML Schema”, “W3C XML Schema”, “WXS” (also “that @#$@%#$**&!! language”)
XSD 1.0 first draft 1999, W3C Recommendation 2001.
XSD 1.1 2012.
A product of struggle: data-heads vs doc-heads.

Key properties of XSD

XSD and DTDs

Ways XSD resembles DTDs:
  • grammar-based
  • same element : production rule relation
  • determinism rule
  • attempts to replicate all DTD functionality (except entity declarations)*
* N.B. omission of entity declarations a technical / political issue, not a value judgement.

DTD--

DTD constructs XSD omits:
  • entity declarations
  • conditional inclusion of blocks

DTD++

Some ways XSD goes beyond DTDs:
  • rational support for namespaces
  • larger datatype system*
  • explicit relations among elements (substitutability)
  • explicit types for elements
  • explicit relations among types (derivation by restriction, extension)**
  • numeric occurrence indicators
  • assertions*
  • conditional type assignment**
* discussed later
** not discussed; you're on your own

The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a schema element:
<xsd:schema>
 <!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify a document-root element / start symbol.

Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Declaring elements

Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
or
 <xsd:element name="l" type="xsd:string"/>

Supporting programming-language and dbms paradigms

The tag/type distinction

Let's generalize from
 <xsd:element name="l" type="xsd:string"/>
Can we do that for every element type?
N.B. four kinds of type
  • element type (vs. element, element instance)
  • data type
    • simple type (lexical form has no markup)
    • complex type (has element children)
In the example, l may be thought of as an accessor.

Top-level named complex types

Named types can be used to capture commonalities:
 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

Top-level complex types

... or just to provide a name for a type:
 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="canzone" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang">

Anonymous types

We can hide things using anonymous local types:
 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name="aufgesang">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="stollen" type="lines"/>
       <xsd:element name="stollen" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
    <xsd:element name="abgesang" type="lines"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
Note nested declarations and definitions.

Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:
  • restriction (preserves subset semantics)
  • extension (preserves prefix semantics)

Inheritance in document systems

Document systems turn out to have a very different model of class systems and inheritance.
  • inheritance of attributes
  • inheritance of locations

Design problems and research questions

7.18.1. Schemas and namespaces

Some (unpleasant) facts of life:
  • Namespaces allow us to distinguish mine from not-mine.
  • Namespaces do not provide universal names.
  • The namespace : language relation is 1:n.
  • The language : grammar relation is 1:n.
  • Therefore, the namespace : schema relation is 1:n.
Live with it.

7.18.2. Schema layers

We distinguish:
  • schema documents (with single target namespace)
  • schemas (sets of abstract components)
Schema composition operations:
  • import
  • include
  • include with override / redefine

7.18.3. Modularization

XML Schema makes it possible to write modular document type definitions:
  • late collection of schema components
  • namespace-aware name matching, validation
  • white-box wildcards (lax / opportunistic)
  • black-box wildcards (skip)

7.18.4. Linking document and schema

  • namespace name
  • schemaLocation hint

7.18.5. Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.
  • additions, no changes
  • type assignment information
  • validation-attempted information (strict, lax, skip)
  • validation-outcome information

7.18.6. Non-local effects

Consider the HTML input element:
  • legal only in p and similar elements
  • legal only within form elements
SGML DTDs have partial solutions:
  • inclusion exceptions
  • content models

7.18.7. Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:
 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

7.18.8. Determinism

The determinism rule remains controversial:
  • LL(1) guarantees may help implementors
  • All regular languages have a deterministic FSA;
  • ... but not necessarily a deterministic regular expression!
  • Implications for closure under union, intersection.
  • Implications for subsumption tests.

Constructs to mention

Other constructs we will discuss if there is time:
  • abstract elements
  • named model groups

Assignments

Assignments

Due: Sunday morning 21 October 2012.
  1. Use declarations in the internal subset of a DTD (the part of the DTD internal to the XML document instance) to modify the XML DTD of ISO 8879 Annex E by adding <emph> and <term> elements as phrase-level elements (to occur wherever <hp1> can occur).
    Optionally, add further elements at phrase-level or other levels, and document them in comments.
The next two assignments relate to the vocabulary described below.
  1. Write a DTD for the vocabulary described.
  2. Write an XSD schema document for the vocabulary described.
If any constraints expressed in the prose are unenforceable in either DTD or XSD notation, write the schema either to overgenerate (i.e. to accept all good documents and fail to reject some bad ones), or else to undergenerate (i.e. to reject all bad ones, at the cost of failing to accept some good ones).

Assignments, background

We defined a vocabulary consisting of the following elements, which obey the constraints indicated.
  • <doc> is the outermost element; it contains a title, an optional copyright statement, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.
  • <title> is the document title; it contains text (i.e. data characters) with phrase-level markup
  • <copyright> is the copyright statement; it contains a sequence of one or more paragraphs.
  • <sec> is a section; it contains a title, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.
(continued ...)

Assignments, background (cont'd)

Some elements are described here as being paragraph-level elements; that means they can occur at the same level as paragraphs, in sections and so on. (Sometimes we say “they occur between paragraphs”.)
  • <p> is a paragraph. It can always contain text and phrase-level markup. As a child of <doc> it can also contain notes and lists; as a child of <note> it can contain lists, but not notes.
  • <note> is a note. It contains a sequence of paragraph-level elements.
  • <list> is an itemized list. It contains a sequence of <item> elements.
  • <item> is a list item. You may choose whether it contains a sequence of paragraphs, or a sequence of paragraph-level elements. (N.B. <item> is not a paragraph-level element; it's listed here to be close to <list>.)
(continued ...)

Assignments, background (cont'd)

The phrase-level elements are:
  • <emph> (for rhetorical emphasis)
  • <term> (for technical terms)
  • <cited> (for titles of books and articles cited)
  • <ital> (for italics not otherwise accounted for)
  • <bold> (for bold not otherwise accounted for)
All phrase level elements can contain character data and phrase-level elements.