Document modeling

Vocabulary design and definition

Introduction to Relax NG

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Rev. 23 October 2012

Nearby documents

Home

Overview

Organizational notes
Vocabulary tour
Introduction to XSD (cont'd)
Introduction to Relax NG
Assignments for 28 October

Organizational notes

Bureaucracy, paperwork, and so on ...

Scheduling future vocabulary tours

This week: DITA (Lehnen, Kapauan)

Future:

23 October: DITA (Lehnen, Kapauan)
30 October: EAD (A. Fenlon) (?)
7 November: tbd (Crist) (?)

Scheduling minute-taking

This week: Crist

Future:

23 October: Crist
30 October: Bilansky
6 November: K. Fenlon
13 November: Kapauan
20 November: Burke

Questions?

Anything we need to clear up before proceeding?

Questions!

The GSLIS curriculum committee wants to hear from you.

Tour of vocabularies, cont'd

This week: DITA

Carl Lehnen
Sandra Kapauan

Where are we

Trying to keep some awareness of context.

Introduction to XSD (second try)

What's XSD?

XML Schema Definition Language

aka “XSDL”, “XML Schema”, “W3C XML Schema”, “WXS” (also “that @#$@%#$**&!! language”)

XSD 1.0 first draft 1999, W3C Recommendation 2001.

XSD 1.1 2012.

A product of struggle: data-heads vs doc-heads.

Key properties of XSD

DTD++, DTD--
instance syntax
supporting programming-language and database-oriented types
design problems

XSD and DTDs

Ways XSD resembles DTDs:

grammar-based
same element : production rule relation
determinism rule
attempts to replicate all DTD functionality (except entity declarations)*

* N.B. omission of entity declarations a technical / political issue, not a value judgement.

DTD--

DTD constructs XSD omits:

entity declarations
conditional inclusion of blocks

DTD++

Some ways XSD goes beyond DTDs:

rational support for namespaces
larger datatype system*
explicit relations among elements (substitutability)
explicit types for elements
explicit relations among types (derivation by restriction, extension)**
numeric occurrence indicators
assertions*
conditional type assignment**

* discussed later

** not discussed; you're on your own

The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.

At the outer level is a schema element:

<xsd:schema>
 <!--* element declarations go here *-->
</xsd:schema>

N.B. the schema does not identify a document-root element / start symbol.

Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Declaring elements

Note difference between element declaration (outer) and element reference (inner).
Implicit occurrence information: min = max = 1.

Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="l" type="xsd:string"/>

Programming and dbms paradigms

tag/type distinction
named and anonymous datatypes
simple datatypes

The tag/type distinction

Let's generalize from

 <xsd:element name="l" type="xsd:string"/>

Can we do that for every element type?

N.B. four kinds of type

element type (vs. element, element instance)
data type
- simple type (lexical form has no markup)
- complex type (has element children and/or attributes)

In the example, l may be thought of as an accessor.

Top-level named complex types

Named types can be used to capture commonalities:

 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

Top-level complex types

... or just to provide a name for a type:

 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="canzone" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang">

Anonymous types

We can hide things using anonymous local types:

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name="aufgesang">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="stollen" type="lines"/>
       <xsd:element name="stollen" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
    <xsd:element name="abgesang" type="lines"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Note nested declarations and definitions.

Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:

restriction (preserves subset semantics)
extension (preserves prefix semantics)

Inheritance in document systems

Document systems turn out to have a very different model of class systems and inheritance.

inheritance of attributes
inheritance of locations

Schemas and namespaces

Some (unpleasant) facts of life:

Namespaces allow us to distinguish mine from not-mine.
Namespaces do not provide universal names.
The namespace : language relation is 1:n.
The language : grammar relation is 1:n.
Therefore, the namespace : schema relation is 1:n.

Live with it.

Schema layers

We distinguish:

schema documents (with single target namespace)
schemas (sets of abstract components)

Schema composition operations:

import
include
include with override / redefine

Modularization

XML Schema makes it possible to write modular document type definitions:

late collection of schema components
namespace-aware name matching, validation
white-box wildcards (lax / opportunistic)
black-box wildcards (skip)

Linking document and schema

namespace name
schemaLocation hint

Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.

additions, no changes
type assignment information
validation-attempted information (strict, lax, skip)
validation-outcome information

Non-local effects

Consider the HTML input element:

legal only in p and similar elements
legal only within form elements

SGML DTDs have partial solutions:

inclusion exceptions
content models

Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:

 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>

One bit of context information = double the size of grammar.

Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

Determinism

The determinism rule remains controversial:

LL(1) guarantees may help implementors
All regular languages have a deterministic FSA;
... but not necessarily a deterministic regular expression!
Implications for closure under union, intersection.
Implications for subsumption tests.

Constructs to mention

Other constructs we will discuss if there is time:

abstract elements
named model groups

Introduction to Relax NG

What's RNG?

Relax NG is a schema language

developed 2000-2001
explicitly as an alternative to XSD
by Murata Makoto, James Clark, and a working group at OASIS
on the basis of
- Murata's earlier work on Relax
- Clark's earlier work on TREX

Key properties of Relax NG

Schema languages are really about syntactic details, not about models

--John Cowan

validation only
- no default values
- no type assignment
- no application dispatch mechanism
- no OO mapping
emphasis on orthogonality and mathematical purity
no* determinism rule (well, hardly any)
alignment* of attributes and elements (to some extent)
two syntaxes: XML and non-XML (‘compact’)

Q. If schema languages aren't about models, why am I teaching you this?

A. I respectfully disagree.

A document grammar

As we did for DTDs and XSD, we'll write a document grammar for poems.

Limericks and canzone:

poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen // aka piedi
stollen   ::= line+
abgesang  ::= line+ // aka cauda, sirima

The canzone schema v.1

As we did before, we'll begin by imitating the grammar slavishly (one rule, one element).

At the outer level is a grammar element:

<rng:grammar>
  <rng:start>
    <!--* description of starting pattern *-->
  </rng:start>
  
 <!--* element declarations go here *-->
</rng:grammar>

Describing the start symbol

Unlike XSD and DTDs, Relax NG grammars explicitly identify a start symbol (or more generally a starting pattern):

  <rng:start>
    <rng:element name="poem">
      ...
    </rng:element>  
  </rng:start>

Declaring elements

The <rng:element> element corresponds* to a BNF production rule (or can do):

    <rng:element name="poem">
      <rng:choice>
        <rng:element name="limerick">
          <!--* ... *-->
        </rng:element>
        
        <rng:element name="canzone">
          <!--* ... *-->
        </rng:element>
      </rng:choice>        
    </rng:element>

But note that <rng:element> elements can nest.

All-inline style

Taking the nesting to its logical conclusion, we get (see also poems.0.rng):

    <rng:element name="poem">
      <rng:choice>
        <rng:element name="limerick">
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="dimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="dimeter">
            <rng:text/>      
          </rng:element>
          <rng:element name="trimeter">
            <rng:text/>      
          </rng:element>
        </rng:element>
        
        <rng:element name="canzone">
          <rng:element name="Aufgesang">
            <rng:element name="Stollen">
              <rng:oneOrMore>
                <rng:element name="line">
                  <rng:text/>
                </rng:element>      
              </rng:oneOrMore>
            </rng:element>
            <rng:element name="Stollen">
              <rng:oneOrMore>
                <rng:element name="line">
                  <rng:text/>
                </rng:element>   
              </rng:oneOrMore>
            </rng:element>
          </rng:element>
          <rng:element name="Abgesang">
            <rng:oneOrMore>
              <rng:element name="line">
                <rng:text/>
              </rng:element>  
            </rng:oneOrMore>
          </rng:element>
        </rng:element>
      </rng:choice>        
    </rng:element>

References

The <rng:element> element functions both as a production rule and as a literal in a RHS.

We can avoid the embedding by using a reference instead of <rng:element>:

    <rng:element name="poem">
      <rng:choice>
        <rng:ref name="limerick"/>
        <rng:ref name="canzone"/>
      </rng:choice>
    </rng:element>

The <rng:ref> element resembles:

(in BNF) RHS reference to a non-terminal
(in XSD) reference to a top-level element, attribute, or type
(in DTD) RHS reference to a parameter entity

Like parameter entities, <rng:ref> can expand to any* expression or sub-expression.

Pattern definitions

The <rng:ref> element refers to a pattern defined elsewhere. For example:

  <rng:define name="limerick">
    <rng:element name="limerick">
      <rng:ref name="trimeter"/>
      <rng:ref name="trimeter"/>
      <rng:ref name="dimeter"/>
      <rng:ref name="dimeter"/>
      <rng:ref name="trimeter"/>
    </rng:element>
  </rng:define>

Idioms for pattern definitions

Several well known idioms for use of patterns:

inline: no definitions at all everything inline (aka ‘Russian doll’)
one element, one pattern: every element gets its own named pattern, containing just that one element (aka ‘salami slice’)
```
  <rng:define name="line">
    <rng:element name="line">
      <rng:text/>
    </rng:element>
  </rng:define>
```

one type*, one pattern: every element's content model gets its own named pattern (aka ‘Venetian blind’)

  <rng:define name="canzone-contents">
    <rng:element name="Aufgesang">
      <rng:ref name="Aufgesang-contents"/>
    </rng:element>
    <rng:element name="Abgesang">
      <rng:ref name="Abgesang-contents"/>
    </rng:element>
  </rng:define>

And of course combinations of the above.

Syntax: basic patterns

Basic patterns are:

<text>
<value> (with a literal value)
<data> (with ...)
<empty>
<element> (with name and content pattern)
<attribute> (with name and content pattern)

What basic patterns mean

The <text> pattern matches zero or more data characters (or white space).

The <value> pattern matches the enclosed literal.

The <data> pattern matches strings that are in the datatype specified by the type attribute.

The <empty> pattern matches the empty sequence (or white space).

An <element> pattern matches an element with an appropriate name, attributes, and contents.

An <attribute> pattern matches an attribute with an appropriate name and value.

The <value> and <data> patterns involve datatypes (to be discussed next week).

Syntax: optionality and repetition

If P is a pattern, then so are:

<optional> P </optional>
<zeroOrMore> P </zeroOrMore>
<oneOrMore> P </oneOrMore>

Occurrence indicators

The <optional> pattern matches zero or one instance of things that match its content.

The <oneOrMore> pattern matches one or more instances of things that match its content.

The <zeroOrMore> pattern matches zero or more instances of things that match its content.

Syntax: composition

If P and Q are patterns, then so are:

<choice> P Q … </choice>
<group> P Q … </group>
P Q … (= <rng:group>)
<interleave> P Q … </interleave>
<mixed> P </mixed>

Sound familiar yet?

Compositors

A <choice> of P or Q matches anything that matches either P or Q.

A <group> of P and Q matches anything whose first part matches P and whose remainder matches Q. (Attributes complicate things slightly.)

The <interleave> of P and Q matches a sequence that can be created by shuffling one sequence matching P with another matching Q.

The <except> of P removes things matching P from the pattern in which it is enclosed.

Interleave example

String S interleaves strings S1 and S2 if:

The characters of S1 appear in S, in order.
The characters of S2 appear in S, in order.
When the characters of S1 are deleted from S, what's left is S2.
When the characters of S2 are deleted from S, what's left is S1.

Example: the interleave of a b c and x y matches

a b c x y
a b x c y
a b x y c
a x b c y
a x b y c
a x y b c
x a b c y
x a b y c
x a y b c
x y a b c

Determinism, again

In DTDs and XSD, the determinism rule amounts to this:

<xsd:choice> P Q … </xsd:choice> is a pattern only if P and Q are patterns, and the legal first characters (or: elements) of P and those of Q are disjoint.
<xsd:sequence> P Q … </xsd:sequence> is a pattern only if P and Q are patterns, and either
- P does not match the empty sequence, or
- the legal first characters (or: elements) of P and those of Q are disjoint.

Determinism, in Relax NG

Relax NG has no determinism rule for choice and sequence.

But it does impose a determinism rule on interleave:

<rng:interleave> P Q … </rng:interleave:choice> is a pattern only if P and Q are patterns, and the characters (or: elements) in P and those in Q are disjoint.

Note: not just the initial characters, but all of them, must be disjoint. So (a, b, c) & (x, y, c) is not legal.

Not covered here

Relax NG has some constructs we omit here for brevity.

Rules for combining grammars from multiple sources.
Detailed rules for datatypes.

Compact syntax (basic patterns)

Relax NG has both an XML and a non-XML syntax.

<text> ⇒ text
<empty> ⇒ empty
<element> ⇒ element name { pattern }
<attribute> ⇒ attribute name { pattern }

Compact syntax (occurrences)

<optional> ⇒ ?
<zeroOrMore> ⇒ *
<oneOrMore> ⇒ +

Compact syntax (compositors)

<choice> ⇒ |
<group> ⇒ ,
<interleave> ⇒ &
<except> ⇒ -
<mixed> ⇒ mixed { pattern }

Compact syntax (defined patterns)

A pattern definition (<define>) ⇒ =.

A reference (<ref>) ⇒ name.

Attributes in content models

The most striking deviation of Relax NG from the model of DTDs and XSD (and context-free grammars in general):

Attributes are declared in the content model, not separately.

Consequences:

Easy to express some co-occurrence constraints, e.g.
- Either attribute a or attribute b
- If att1="latin" then contents are a b c, if att1="greek" then contents are alpha beta gamma, ...

Ad hoc semantics:

attribute a, x, y, z

has same meaning as

x, y, attribute a, z

Enumerations (<value>)

To enumerate the possible values of a construct, use a choice of values:

  <define name="id-number-type">
    <choice>
      <value>isbn</value>
      <value>issn</value>
      <value>lccn</value>
      <value>CODEN</value>
    </choice>
  </define>

id-number-type = ("isbn" | "issn" | "lccn" | "CODEN")

The poetry grammar

Sample RNG encodings of the poetry grammar illustrate different styles of Relax NG usage:

version 0 (XML, compact)
version 1 (XML, compact)
version 2 (XML, compact)
version 3 (XML, compact)

The same limitations apply as for the DTD and XSD versions:

The two Stollen must have same number of lines; this rule is not expressed.
The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed.

Non-local effects

Consider the HTML input element:

legal only in p and similar elements
legal only within form elements

SGML DTDs have partial solutions:

inclusion exceptions
content models

XSD and Relax NG have similar solutions:

local element / type binding (XSD)
local element / reference binding (RNG)

Non-local effects in Relax NG

Fundamentally, we trade verbosity for context-sensitivity. First we have a normal hierarchy:

start = element doc { doc-model }
doc-model = title, para-level*, (\div | form)*
title = element title { text }
para-level = p | note | \list
p = element p { (text | phrase | note | \list)* }
\list =
  element list {
    element item { p+ }+
  }
note = element note { p-in-note+ }
p-in-note = element p { mixed-phrases }
mixed-phrases = (text | phrase)*
phrase = emph | term | ital | bold
emph = element emph { mixed-phrases }
term = element term { mixed-phrases }
ital = element ital { mixed-phrases }
bold = element bold { mixed-phrases }
\div = element div { title, para-level*, (\div | form)* }

One bit of context information = double the size of grammar.

Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

Non-local effects (2)

Then we have a second hierarchy, within forms:

form =
  element form {
    attribute action { text },
    para-level-in-form*,
    (div-in-form)*
  }
para-level-in-form = p-in-form | note-in-form | list-in-form
p-in-form =
  element p { 
    (text | input | phrase | note-in-form | list-in-form)* 
}
list-in-form =
  element list {
    element item { p-in-form+ }+
  }
note-in-form = element note { p-in-note-in-form+ }
p-in-note-in-form = element p { (text, phrase, input)* }
input =
  element input {
    attribute type { text },
    text
  }
div-in-form = element div { 
  title, para-level-in-form*, (div-in-form)* 
}

One bit of context information = double the size of grammar.

Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

Assignments

Due: Sunday morning 28 October 2012.

The first assignment relates to the vocabulary described last week (description repeated below).

Write a Relax NG schema for the vocabulary described.

If any constraints expressed in the prose are unenforceable in RNG notation, write the schema either to overgenerate (i.e. to accept all good documents and fail to reject some bad ones), or else to undergenerate (i.e. to reject all bad ones, at the cost of failing to accept some good ones).

The next assignment is our chance for a mid-course correction.

Take stock.

What modeling-related topics or questions do you most need or want practice and help with, over the next six or seven weeks?

What assignments would you assign yourself, if you had someone to give you feedback on them?

Assignments, background

We defined a vocabulary consisting of the following elements, which obey the constraints indicated.

<doc> is the outermost element; it contains a title, an optional copyright statement, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.
<title> is the document title; it contains text (i.e. data characters) with phrase-level markup
<copyright> is the copyright statement; it contains a sequence of one or more paragraphs.
<sec> is a section; it contains a title, a sequence of zero or more paragraph-level elements, and a sequence of zero or more sections (<div> elements), in that order.

(continued ...)

Assignments, background (cont'd)

Some elements are described here as being paragraph-level elements; that means they can occur at the same level as paragraphs, in sections and so on. (Sometimes we say “they occur between paragraphs”.)

<p> is a paragraph. It can always contain text and phrase-level markup. As a child of <doc> it can also contain notes and lists; as a child of <note> it can contain lists, but not notes.
<note> is a note. It contains a sequence of paragraph-level elements.
<list> is an itemized list. It contains a sequence of <item> elements.
<item> is a list item. You may choose whether it contains a sequence of paragraphs, or a sequence of paragraph-level elements. (N.B. <item> is not a paragraph-level element; it's listed here to be close to <list>.)

(continued ...)

Assignments, background (cont'd)

The phrase-level elements are:

<emph> (for rhetorical emphasis)
<term> (for technical terms)
<cited> (for titles of books and articles cited)
<ital> (for italics not otherwise accounted for)
<bold> (for bold not otherwise accounted for)

All phrase level elements can contain character data and phrase-level elements.

Document modeling

Vocabulary design and definition

Introduction to Relax NG

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Rev. 23 October 2012

Nearby documents

Overview

Organizational notes

Scheduling future vocabulary tours

Scheduling minute-taking

Questions?

Questions!

Tour of vocabularies, cont'd

Where are we

You are here

Introduction to XSD (second try)

What's XSD?

Key properties of XSD

XSD and DTDs

DTD--

DTD++

The canzone schema v.1

Declaring elements

Declaring elements

Positive closure

Character data

Programming and dbms paradigms

The tag/type distinction

Top-level named complex types

Top-level complex types

Anonymous types

Inheritance Type derivation

Inheritance in document systems

Schemas and namespaces

Schema layers

Modularization

Linking document and schema

Post-schema-validation infoset (PSVI)

Non-local effects

Non-local effects in XML Schema

Determinism

Constructs to mention

Introduction to Relax NG

What's RNG?

Key properties of Relax NG

A document grammar

The canzone schema v.1

Describing the start symbol

Declaring elements

All-inline style

References

Pattern definitions

Idioms for pattern definitions

Syntax: basic patterns

What basic patterns mean

Syntax: optionality and repetition

Occurrence indicators

Syntax: composition

Compositors

Interleave example

Determinism, again

Determinism, in Relax NG

Not covered here

Compact syntax (basic patterns)

Compact syntax (occurrences)

Compact syntax (compositors)

Compact syntax (defined patterns)

Attributes in content models

Enumerations (<value>)

The poetry grammar

Non-local effects

Non-local effects in Relax NG

Non-local effects (2)

Assignments

Assignments

Assignments, background

Assignments, background (cont'd)

Assignments, background (cont'd)