Document modeling

Vocabulary design and definition

Introduction to Relax NG

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Rev. 23 October 2012

Nearby documents


Overview

Organizational notes

Bureaucracy, paperwork, and so on ...

Scheduling future vocabulary tours

This week: DITA (Lehnen, Kapauan)
Future:
  • 23 October: DITA (Lehnen, Kapauan)
  • 30 October: EAD (A. Fenlon) (?)
  • 7 November: tbd (Crist) (?)

Scheduling minute-taking

This week: Crist
Future:
  • 23 October: Crist
  • 30 October: Bilansky
  • 6 November: K. Fenlon
  • 13 November: Kapauan
  • 20 November: Burke

Questions?

Anything we need to clear up before proceeding?

Questions!

The GSLIS curriculum committee wants to hear from you.

Tour of vocabularies, cont'd

This week: DITA

Where are we

Trying to keep some awareness of context.

You are here

Overview
Representation of individual documents
  • introduction to SGML and XML
    • syntax (angle brackets)
    • model (trees)
    • DTDs
  • historical survey
Schema languages
  • DTDs
  • → XSD
  • → Relax NG
  • Schematron
Semantics
Student projects
Wrap-up

Introduction to XSD (second try)

What's XSD?

XML Schema Definition Language
aka “XSDL”, “XML Schema”, “W3C XML Schema”, “WXS” (also “that @#$@%#$**&!! language”)
XSD 1.0 first draft 1999, W3C Recommendation 2001.
XSD 1.1 2012.
A product of struggle: data-heads vs doc-heads.

Key properties of XSD

XSD and DTDs

Ways XSD resembles DTDs:
  • grammar-based
  • same element : production rule relation
  • determinism rule
  • attempts to replicate all DTD functionality (except entity declarations)*
* N.B. omission of entity declarations a technical / political issue, not a value judgement.

DTD--

DTD constructs XSD omits:
  • entity declarations
  • conditional inclusion of blocks

DTD++

Some ways XSD goes beyond DTDs:
  • rational support for namespaces
  • larger datatype system*
  • explicit relations among elements (substitutability)
  • explicit types for elements
  • explicit relations among types (derivation by restriction, extension)**
  • numeric occurrence indicators
  • assertions*
  • conditional type assignment**
* discussed later
** not discussed; you're on your own

The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a schema element:
<xsd:schema>
 <!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify a document-root element / start symbol.

Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Declaring elements

Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
or
 <xsd:element name="l" type="xsd:string"/>

Programming and dbms paradigms

The tag/type distinction

Let's generalize from
 <xsd:element name="l" type="xsd:string"/>
Can we do that for every element type?
N.B. four kinds of type
  • element type (vs. element, element instance)
  • data type
    • simple type (lexical form has no markup)
    • complex type (has element children and/or attributes)
In the example, l may be thought of as an accessor.

Top-level named complex types

Named types can be used to capture commonalities:
 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

Top-level complex types

... or just to provide a name for a type:
 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="canzone" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang">

Anonymous types

We can hide things using anonymous local types:
 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name="aufgesang">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="stollen" type="lines"/>
       <xsd:element name="stollen" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
    <xsd:element name="abgesang" type="lines"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
Note nested declarations and definitions.

Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:
  • restriction (preserves subset semantics)
  • extension (preserves prefix semantics)

Inheritance in document systems

Document systems turn out to have a very different model of class systems and inheritance.
  • inheritance of attributes
  • inheritance of locations

Schemas and namespaces

Some (unpleasant) facts of life:
  • Namespaces allow us to distinguish mine from not-mine.
  • Namespaces do not provide universal names.
  • The namespace : language relation is 1:n.
  • The language : grammar relation is 1:n.
  • Therefore, the namespace : schema relation is 1:n.
Live with it.

Schema layers

We distinguish:
  • schema documents (with single target namespace)
  • schemas (sets of abstract components)
Schema composition operations:
  • import
  • include
  • include with override / redefine

Modularization

XML Schema makes it possible to write modular document type definitions:
  • late collection of schema components
  • namespace-aware name matching, validation
  • white-box wildcards (lax / opportunistic)
  • black-box wildcards (skip)

Linking document and schema

  • namespace name
  • schemaLocation hint

Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.
  • additions, no changes
  • type assignment information
  • validation-attempted information (strict, lax, skip)
  • validation-outcome information

Non-local effects

Consider the HTML input element:
  • legal only in p and similar elements
  • legal only within form elements
SGML DTDs have partial solutions:
  • inclusion exceptions
  • content models

Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:
 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

Determinism

The determinism rule remains controversial:
  • LL(1) guarantees may help implementors
  • All regular languages have a deterministic FSA;
  • ... but not necessarily a deterministic regular expression!
  • Implications for closure under union, intersection.
  • Implications for subsumption tests.

Constructs to mention

Other constructs we will discuss if there is time:
  • abstract elements
  • named model groups

Introduction to Relax NG

What's RNG?

Relax NG is a schema language
  • developed 2000-2001
  • explicitly as an alternative to XSD
  • by Murata Makoto, James Clark, and a working group at OASIS
  • on the basis of
    • Murata's earlier work on Relax
    • Clark's earlier work on TREX

Key properties of Relax NG

Schema languages are really about syntactic details, not about models

--John Cowan

Q. If schema languages aren't about models, why am I teaching you this?
A. I respectfully disagree.

A document grammar

As we did for DTDs and XSD, we'll write a document grammar for poems.
Limericks and canzone:
poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen // aka piedi
stollen   ::= line+
abgesang  ::= line+ // aka cauda, sirima

The canzone schema v.1

As we did before, we'll begin by imitating the grammar slavishly (one rule, one element).
At the outer level is a grammar element:
<rng:grammar>
  <rng:start>
    <!--* description of starting pattern *-->
  </rng:start>
  
 <!--* element declarations go here *-->
</rng:grammar>

Describing the start symbol

Unlike XSD and DTDs, Relax NG grammars explicitly identify a start symbol (or more generally a starting pattern):
  <rng:start>
    <rng:element name="poem">
      ...
    </rng:element>  
  </rng:start>

Declaring elements

The <rng:element> element corresponds* to a BNF production rule (or can do):
    <rng:element name="poem">
      <rng:choice>
        <rng:element name="limerick">
          <!--* ... *-->
        </rng:element>
        
        <rng:element name="canzone">
          <!--* ... *-->
        </rng:element>
      </rng:choice>        
    </rng:element>
But note that <rng:element> elements can nest.

All-inline style

Taking the nesting to its logical conclusion, we get (see also poems.0.rng):
    <rng:element name="poem">
      <rng:choice>
        <rng:element name="limerick">
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="dimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="dimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
        </rng:element>
        
        <rng:element name="canzone">
          <rng:element name="Aufgesang">
            <rng:element name="Stollen">
              <rng:oneOrMore>
                <rng:element name="line">
                  <rng:text/>
                </rng:element>      
              </rng:oneOrMore>
            </rng:element>
            <rng:element name="Stollen">
              <rng:oneOrMore>
                <rng:element name="line">
                  <rng:text/>
                </rng:element>   
              </rng:oneOrMore>
            </rng:element>
          </rng:element>
          <rng:element name="Abgesang">
            <rng:oneOrMore>
              <rng:element name="line">
                <rng:text/>
              </rng:element>  
            </rng:oneOrMore>
          </rng:element>
        </rng:element>
      </rng:choice>        
    </rng:element>

References

The <rng:element> element functions both as a production rule and as a literal in a RHS.
We can avoid the embedding by using a reference instead of <rng:element>:
    <rng:element name="poem">
      <rng:choice>
        <rng:ref name="limerick"/>
        <rng:ref name="canzone"/>
      </rng:choice>
    </rng:element>
The <rng:ref> element resembles:
  • (in BNF) RHS reference to a non-terminal
  • (in XSD) reference to a top-level element, attribute, or type
  • (in DTD) RHS reference to a parameter entity
Like parameter entities, <rng:ref> can expand to any* expression or sub-expression.

Pattern definitions

The <rng:ref> element refers to a pattern defined elsewhere. For example:
  <rng:define name="limerick">
    <rng:element name="limerick">
      <rng:ref name="trimeter"/>
      <rng:ref name="trimeter"/>
      <rng:ref name="dimeter"/>
      <rng:ref name="dimeter"/>
      <rng:ref name="trimeter"/>
    </rng:element>
  </rng:define>

Idioms for pattern definitions

Several well known idioms for use of patterns:
  • inline: no definitions at all everything inline (aka ‘Russian doll’)
  • one element, one pattern: every element gets its own named pattern, containing just that one element (aka ‘salami slice’)
      <rng:define name="line">
        <rng:element name="line">
          <rng:text/>
        </rng:element>
      </rng:define>
  • one type*, one pattern: every element's content model gets its own named pattern (aka ‘Venetian blind’)
      <rng:define name="canzone-contents">
        <rng:element name="Aufgesang">
          <rng:ref name="Aufgesang-contents"/>
        </rng:element>
        <rng:element name="Abgesang">
          <rng:ref name="Abgesang-contents"/>
        </rng:element>
      </rng:define>
And of course combinations of the above.

Syntax: basic patterns

Basic patterns are:
  • <text>
  • <value> (with a literal value)
  • <data> (with ...)
  • <empty>
  • <element> (with name and content pattern)
  • <attribute> (with name and content pattern)

What basic patterns mean

The <text> pattern matches zero or more data characters (or white space).
The <value> pattern matches the enclosed literal.
The <data> pattern matches strings that are in the datatype specified by the type attribute.
The <empty> pattern matches the empty sequence (or white space).
An <element> pattern matches an element with an appropriate name, attributes, and contents.
An <attribute> pattern matches an attribute with an appropriate name and value.
The <value> and <data> patterns involve datatypes (to be discussed next week).

Syntax: optionality and repetition

If P is a pattern, then so are:
  • <optional> P </optional>
  • <zeroOrMore> P </zeroOrMore>
  • <oneOrMore> P </oneOrMore>

Occurrence indicators

The <optional> pattern matches zero or one instance of things that match its content.
The <oneOrMore> pattern matches one or more instances of things that match its content.
The <zeroOrMore> pattern matches zero or more instances of things that match its content.

Syntax: composition

If P and Q are patterns, then so are:
  • <choice> P Q</choice>
  • <group> P Q</group>
  • P Q … (= <rng:group>)
  • <interleave> P Q</interleave>
  • <mixed> P </mixed>
Sound familiar yet?

Compositors

A <choice> of P or Q matches anything that matches either P or Q.
A <group> of P and Q matches anything whose first part matches P and whose remainder matches Q. (Attributes complicate things slightly.)
The <interleave> of P and Q matches a sequence that can be created by shuffling one sequence matching P with another matching Q.
The <except> of P removes things matching P from the pattern in which it is enclosed.

Interleave example

String S interleaves strings S1 and S2 if:
  • The characters of S1 appear in S, in order.
  • The characters of S2 appear in S, in order.
  • When the characters of S1 are deleted from S, what's left is S2.
  • When the characters of S2 are deleted from S, what's left is S1.
Example: the interleave of a b c and x y matches
  • a b c x y
  • a b x c y
  • a b x y c
  • a x b c y
  • a x b y c
  • a x y b c
  • x a b c y
  • x a b y c
  • x a y b c
  • x y a b c

Determinism, again

In DTDs and XSD, the determinism rule amounts to this:
  • <xsd:choice> P Q</xsd:choice> is a pattern only if P and Q are patterns, and the legal first characters (or: elements) of P and those of Q are disjoint.
  • <xsd:sequence> P Q</xsd:sequence> is a pattern only if P and Q are patterns, and either
    • P does not match the empty sequence, or
    • the legal first characters (or: elements) of P and those of Q are disjoint.

Determinism, in Relax NG

Relax NG has no determinism rule for choice and sequence.
But it does impose a determinism rule on interleave:
  • <rng:interleave> P Q</rng:interleave:choice> is a pattern only if P and Q are patterns, and the characters (or: elements) in P and those in Q are disjoint.
Note: not just the initial characters, but all of them, must be disjoint. So (a, b, c) & (x, y, c) is not legal.

Not covered here

Relax NG has some constructs we omit here for brevity.
  • Rules for combining grammars from multiple sources.
  • Detailed rules for datatypes.

Compact syntax (basic patterns)

Relax NG has both an XML and a non-XML syntax.
  • <text> ⇒ text
  • <empty> ⇒ empty
  • <element> ⇒ element name { pattern }
  • <attribute> ⇒ attribute name { pattern }

Compact syntax (occurrences)

  • <optional> ⇒ ?
  • <zeroOrMore> ⇒ *
  • <oneOrMore> ⇒ +

Compact syntax (compositors)

  • <choice> ⇒ |
  • <group> ⇒ ,
  • <interleave> ⇒ &
  • <except> ⇒ -
  • <mixed> ⇒ mixed { pattern }

Compact syntax (defined patterns)

A pattern definition (<define>) ⇒ =.
A reference (<ref>) ⇒ name.

Attributes in content models

The most striking deviation of Relax NG from the model of DTDs and XSD (and context-free grammars in general):
Attributes are declared in the content model, not separately.
Consequences:
  • Easy to express some co-occurrence constraints, e.g.
    • Either attribute a or attribute b
    • If att1="latin" then contents are a b c, if att1="greek" then contents are alpha beta gamma, ...
  • Ad hoc semantics:
    attribute a, x, y, z
    has same meaning as
    x, y, attribute a, z

Enumerations (<value>)

To enumerate the possible values of a construct, use a choice of values:
  <define name="id-number-type">
    <choice>
      <value>isbn</value>
      <value>issn</value>
      <value>lccn</value>
      <value>CODEN</value>
    </choice>
  </define>
Or
id-number-type = ("isbn" | "issn" | "lccn" | "CODEN")

The poetry grammar

Sample RNG encodings of the poetry grammar illustrate different styles of Relax NG usage:
The same limitations apply as for the DTD and XSD versions:
  • The two Stollen must have same number of lines; this rule is not expressed.
  • The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed.

Non-local effects

Consider the HTML input element:
  • legal only in p and similar elements
  • legal only within form elements
SGML DTDs have partial solutions:
  • inclusion exceptions
  • content models
XSD and Relax NG have similar solutions:
  • local element / type binding (XSD)
  • local element / reference binding (RNG)

Non-local effects in Relax NG

Fundamentally, we trade verbosity for context-sensitivity. First we have a normal hierarchy:
start = element doc { doc-model }
doc-model = title, para-level*, (\div | form)*
title = element title { text }
para-level = p | note | \list
p = element p { (text | phrase | note | \list)* }
\list =
  element list {
    element item { p+ }+
  }
note = element note { p-in-note+ }
p-in-note = element p { mixed-phrases }
mixed-phrases = (text | phrase)*
phrase = emph | term | ital | bold
emph = element emph { mixed-phrases }
term = element term { mixed-phrases }
ital = element ital { mixed-phrases }
bold = element bold { mixed-phrases }
\div = element div { title, para-level*, (\div | form)* }
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

Non-local effects (2)

Then we have a second hierarchy, within forms:
form =
  element form {
    attribute action { text },
    para-level-in-form*,
    (div-in-form)*
  }
para-level-in-form = p-in-form | note-in-form | list-in-form
p-in-form =
  element p { 
    (text | input | phrase | note-in-form | list-in-form)* 
}
list-in-form =
  element list {
    element item { p-in-form+ }+
  }
note-in-form = element note { p-in-note-in-form+ }
p-in-note-in-form = element p { (text, phrase, input)* }
input =
  element input {
    attribute type { text },
    text
  }
div-in-form = element div { 
  title, para-level-in-form*, (div-in-form)* 
}
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

Assignments

Assignments

Due: Sunday morning 28 October 2012.
The first assignment relates to the vocabulary described last week (description repeated below).
  1. Write a Relax NG schema for the vocabulary described.
If any constraints expressed in the prose are unenforceable in RNG notation, write the schema either to overgenerate (i.e. to accept all good documents and fail to reject some bad ones), or else to undergenerate (i.e. to reject all bad ones, at the cost of failing to accept some good ones).
The next assignment is our chance for a mid-course correction.
  1. Take stock.
    What modeling-related topics or questions do you most need or want practice and help with, over the next six or seven weeks?
    What assignments would you assign yourself, if you had someone to give you feedback on them?

Assignments, background

We defined a vocabulary consisting of the following elements, which obey the constraints indicated.
  • <doc> is the outermost element; it contains a title, an optional copyright statement, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.
  • <title> is the document title; it contains text (i.e. data characters) with phrase-level markup
  • <copyright> is the copyright statement; it contains a sequence of one or more paragraphs.
  • <sec> is a section; it contains a title, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.
(continued ...)

Assignments, background (cont'd)

Some elements are described here as being paragraph-level elements; that means they can occur at the same level as paragraphs, in sections and so on. (Sometimes we say “they occur between paragraphs”.)
  • <p> is a paragraph. It can always contain text and phrase-level markup. As a child of <doc> it can also contain notes and lists; as a child of <note> it can contain lists, but not notes.
  • <note> is a note. It contains a sequence of paragraph-level elements.
  • <list> is an itemized list. It contains a sequence of <item> elements.
  • <item> is a list item. You may choose whether it contains a sequence of paragraphs, or a sequence of paragraph-level elements. (N.B. <item> is not a paragraph-level element; it's listed here to be close to <list>.)
(continued ...)

Assignments, background (cont'd)

The phrase-level elements are:
  • <emph> (for rhetorical emphasis)
  • <term> (for technical terms)
  • <cited> (for titles of books and articles cited)
  • <ital> (for italics not otherwise accounted for)
  • <bold> (for bold not otherwise accounted for)
All phrase level elements can contain character data and phrase-level elements.