Document modeling
Vocabulary design and definition
Introduction to Relax NG
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
Rev. 23 October 2012
Overview
- Organizational notes
- Vocabulary tour
- Introduction to XSD (cont'd)
- Introduction to Relax NG
- Assignments for 28 October
Organizational notes
Bureaucracy, paperwork, and so on ...
Scheduling future vocabulary tours
This week: DITA (Lehnen, Kapauan)
Future:
- 23 October: DITA (Lehnen, Kapauan)
- 30 October: EAD (A. Fenlon) (?)
- 7 November: tbd (Crist) (?)
Scheduling minute-taking
This week: Crist
Future:
- 23 October: Crist
- 30 October: Bilansky
- 6 November: K. Fenlon
- 13 November: Kapauan
- 20 November: Burke
Questions?
Anything we need to clear up before proceeding?
Tour of vocabularies, cont'd
This week: DITA
- Carl Lehnen
- Sandra Kapauan
Where are we
Trying to keep some awareness of context.
You are here
Overview
Representation of individual documents
- introduction to SGML and XML
- syntax (angle brackets)
- model (trees)
- DTDs
- historical survey
Schema languages
- DTDs
- → XSD
- → Relax NG
- Schematron
Semantics
Student projects
Wrap-up
Introduction to XSD (second try)
What's XSD?
XML Schema Definition Language
aka “XSDL”,
“XML Schema”,
“W3C XML Schema”,
“WXS”
(also “that @#$@%#$**&!! language”)
XSD 1.0 first draft 1999, W3C Recommendation 2001.
XSD 1.1 2012.
A product of struggle:
data-heads vs doc-heads.
Key properties of XSD
- DTD++, DTD--
- instance syntax
- supporting programming-language and database-oriented
types
- design problems
XSD and DTDs
Ways XSD resembles DTDs:
- grammar-based
- same element : production rule relation
- determinism rule
- attempts to replicate all DTD functionality
(except entity declarations)*
* N.B. omission of entity declarations a
technical / political issue, not a value
judgement.
DTD--
DTD constructs XSD omits:
- entity declarations
- conditional inclusion of blocks
DTD++
Some ways XSD goes beyond DTDs:
- rational support for namespaces
- larger datatype system*
- explicit relations among elements
(substitutability)
- explicit types for elements
- explicit relations among types
(derivation by restriction, extension)**
- numeric occurrence indicators
- assertions*
- conditional type assignment**
* discussed later
** not discussed; you're on your own
The canzone schema v.1
In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a
schema element:
<xsd:schema>
<!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify
a document-root element / start symbol.
Declaring elements
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Declaring elements
- Note difference between element declaration (outer)
and element reference (inner).
- Implicit occurrence information: min = max = 1.
Positive closure
<xsd:element name="abgesang">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="stollen">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Character data
<xsd:element name="l">
<xsd:complexType mixed="true">
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
or
<xsd:element name="l" type="xsd:string"/>
Programming and dbms paradigms
- tag/type distinction
- named and anonymous datatypes
- simple datatypes
The tag/type distinction
Let's generalize from
<xsd:element name="l" type="xsd:string"/>
Can we do that for every element type?
N.B. four kinds of
type- element type (vs. element, element instance)
- data type
- simple type (lexical form has no markup)
- complex type (has element children and/or attributes)
In the example,
l may be thought of as an
accessor.
Top-level named complex types
Named types can be used to capture commonalities:
<xsd:complexType name="lines">
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="abgesang" type="lines">
<xsd:element name="stollen" type="lines">
Top-level complex types
... or just to provide a name for a type:
<xsd:complexType name="canzoneform">
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="aufgesang">
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="canzone" type="canzoneform"/>
<xsd:element name="aufgesang" type="aufgesang">
Anonymous types
We can hide things using
anonymous local types:
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="stollen" type="lines"/>
<xsd:element name="stollen" type="lines"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="abgesang" type="lines"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Note nested declarations and definitions.
Inheritance Type derivation
It turns out to be hard to model stepwise refinement
of types:
- restriction (preserves subset semantics)
- extension (preserves prefix semantics)
Inheritance in document systems
Document systems turn out to have a very different model
of class systems and inheritance.
- inheritance of attributes
- inheritance of locations
Schemas and namespaces
Some (unpleasant) facts of life:
- Namespaces allow us to distinguish mine
from not-mine.
- Namespaces do not provide universal names.
- The namespace : language
relation is 1:n.
- The language : grammar
relation is 1:n.
- Therefore, the namespace : schema
relation is 1:n.
Live with it.
Schema layers
We distinguish:
- schema documents (with single target namespace)
- schemas (sets of abstract components)
Schema composition operations:
- import
- include
- include with override / redefine
Modularization
XML Schema makes it possible to write modular document
type definitions:
- late collection of schema components
- namespace-aware name matching, validation
- white-box wildcards (lax / opportunistic)
- black-box wildcards (skip)
Linking document and schema
- namespace name
- schemaLocation hint
Post-schema-validation infoset (PSVI)
XML-Schema validation: infoset → infoset.
- additions, no changes
- type assignment information
- validation-attempted information (strict, lax, skip)
- validation-outcome information
Non-local effects
Consider the HTML
input element:
- legal only in p and similar elements
- legal only within form elements
SGML DTDs have partial solutions:
- inclusion exceptions
- content models
Non-local effects in XML Schema
Fundamentally, we trade verbosity for context-sensitivity:
<xsd:element name="div" type="div-type"/>
<xsd:element name="div" type="div-in-form-type"/>
<xsd:element name="p" type="p-type"/>
<xsd:element name="p" type="p-in-form-type"/>
<xsd:element name="ul" type="ul-type"/>
<xsd:element name="ul" type="ul-in-form-type"/>
<xsd:element name="li" type="li-type"/>
<xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of
context sensitivity).
Determinism
The determinism rule remains controversial:
- LL(1) guarantees may help implementors
- All regular languages have a deterministic FSA;
- ... but not necessarily a deterministic
regular expression!
- Implications for closure under union, intersection.
- Implications for subsumption tests.
Constructs to mention
Other constructs we will discuss if there
is time:
- abstract elements
- named model groups
Introduction to Relax NG
What's RNG?
Relax NG is a schema language
- developed 2000-2001
- explicitly as an alternative to XSD
- by Murata Makoto, James Clark,
and a working group at OASIS
- on the basis of
- Murata's earlier work on Relax
- Clark's earlier work on TREX
Key properties of Relax NG
Schema languages are really about syntactic
details, not about models
--John Cowan
- validation only
- no default values
- no type assignment
- no application dispatch mechanism
- no OO mapping
- emphasis on orthogonality and
mathematical purity
- no* determinism rule (well, hardly any)
- alignment* of attributes and elements (to some extent)
- two syntaxes: XML and non-XML (‘compact’)
Q. If schema languages aren't about models,
why am I teaching you this?
A. I respectfully disagree.
A document grammar
As we did for DTDs and XSD, we'll write a document grammar
for poems.
Limericks and canzone:
poem ::= limerick | canzone
limerick ::= trimeter trimeter dimeter
dimeter trimeter
trimeter ::= CHAR+
dimeter ::= CHAR+
canzone ::= aufgesang abgesang
aufgesang ::= stollen stollen // aka piedi
stollen ::= line+
abgesang ::= line+ // aka cauda, sirima
The canzone schema v.1
As we did before, we'll begin by imitating the grammar slavishly
(one rule, one element).
At the outer level is a
grammar element:
<rng:grammar>
<rng:start>
<!--* description of starting pattern *-->
</rng:start>
<!--* element declarations go here *-->
</rng:grammar>
Describing the start symbol
Unlike XSD and DTDs, Relax NG grammars explicitly
identify a start symbol (or more generally a
starting
pattern):
<rng:start>
<rng:element name="poem">
...
</rng:element>
</rng:start>
Declaring elements
The <
rng:element> element corresponds*
to a BNF production rule
(or can do):
<rng:element name="poem">
<rng:choice>
<rng:element name="limerick">
<!--* ... *-->
</rng:element>
<rng:element name="canzone">
<!--* ... *-->
</rng:element>
</rng:choice>
</rng:element>
But note that <rng:element> elements
can nest.
All-inline style
Taking the nesting to its logical conclusion, we get
(see also
poems.0.rng):
<rng:element name="poem">
<rng:choice>
<rng:element name="limerick">
<rng:element name="trimeter">
<rng:text/>
</rng:element>
<rng:element name="trimeter">
<rng:text/>
</rng:element>
<rng:element name="trimeter">
<rng:text/>
</rng:element>
<rng:element name="dimeter">
<rng:text/>
</rng:element>
<rng:element name="dimeter">
<rng:text/>
</rng:element>
<rng:element name="trimeter">
<rng:text/>
</rng:element>
</rng:element>
<rng:element name="canzone">
<rng:element name="Aufgesang">
<rng:element name="Stollen">
<rng:oneOrMore>
<rng:element name="line">
<rng:text/>
</rng:element>
</rng:oneOrMore>
</rng:element>
<rng:element name="Stollen">
<rng:oneOrMore>
<rng:element name="line">
<rng:text/>
</rng:element>
</rng:oneOrMore>
</rng:element>
</rng:element>
<rng:element name="Abgesang">
<rng:oneOrMore>
<rng:element name="line">
<rng:text/>
</rng:element>
</rng:oneOrMore>
</rng:element>
</rng:element>
</rng:choice>
</rng:element>
References
The <rng:element> element functions
both as a production rule
and as a literal in a RHS.
We can avoid the embedding by using a
reference instead of <
rng:element>:
<rng:element name="poem">
<rng:choice>
<rng:ref name="limerick"/>
<rng:ref name="canzone"/>
</rng:choice>
</rng:element>
The <
rng:ref> element resembles:
- (in BNF) RHS reference to a non-terminal
- (in XSD) reference to a top-level element, attribute, or type
- (in DTD) RHS reference to a parameter entity
Like parameter entities, <
rng:ref> can expand to any*
expression or sub-expression.
Pattern definitions
The <
rng:ref> element refers to a pattern
defined elsewhere. For example:
<rng:define name="limerick">
<rng:element name="limerick">
<rng:ref name="trimeter"/>
<rng:ref name="trimeter"/>
<rng:ref name="dimeter"/>
<rng:ref name="dimeter"/>
<rng:ref name="trimeter"/>
</rng:element>
</rng:define>
Idioms for pattern definitions
Several well known idioms for use of patterns:
- inline: no definitions at all
everything inline (aka ‘Russian doll’)
- one element, one pattern: every element
gets its own named pattern, containing just that
one element (aka ‘salami slice’)
<rng:define name="line">
<rng:element name="line">
<rng:text/>
</rng:element>
</rng:define>
- one type*, one pattern: every
element's content model gets its own named pattern
(aka ‘Venetian blind’)
<rng:define name="canzone-contents">
<rng:element name="Aufgesang">
<rng:ref name="Aufgesang-contents"/>
</rng:element>
<rng:element name="Abgesang">
<rng:ref name="Abgesang-contents"/>
</rng:element>
</rng:define>
And of course combinations of the above.
Syntax: basic patterns
Basic patterns are:
- <text>
- <value> (with a literal value)
- <data> (with ...)
- <empty>
- <element> (with name and content pattern)
- <attribute> (with name and content pattern)
What basic patterns mean
The <text> pattern matches zero or more data characters (or
white space).
The <value> pattern matches the enclosed literal.
The <data> pattern matches strings that are in
the datatype specified by the type attribute.
The <empty> pattern matches the empty sequence (or white space).
An <element> pattern matches an element with an
appropriate name, attributes, and contents.
An <attribute> pattern matches an attribute with an
appropriate name and value.
The <value> and <data> patterns involve
datatypes (to be discussed next week).
Syntax: optionality and repetition
If
P is a pattern, then so are:
- <optional> P </optional>
- <zeroOrMore> P </zeroOrMore>
- <oneOrMore> P </oneOrMore>
Occurrence indicators
The <optional> pattern matches zero or one instance
of things that match its content.
The <oneOrMore> pattern matches one or more instances
of things that match its content.
The <zeroOrMore> pattern matches zero or more instances
of things that match its content.
Syntax: composition
If
P and
Q are patterns, then so are:
- <choice> P Q … </choice>
- <group> P Q … </group>
- P Q … (= <rng:group>)
- <interleave> P Q … </interleave>
- <mixed> P </mixed>
Sound familiar yet?
Compositors
A <choice> of P or Q matches anything that matches
either P or Q.
A <group> of P and Q matches anything whose first part matches
P and whose remainder matches Q. (Attributes complicate things
slightly.)
The <interleave> of P and Q matches a sequence that can be
created by shuffling one sequence matching P with another matching
Q.
The <except> of P removes things matching P from
the pattern in which it is enclosed.
Interleave example
String S interleaves strings S1 and S2 if:
- The characters of S1 appear in S, in order.
- The characters of S2 appear in S, in order.
- When the characters of S1 are deleted from S, what's left is S2.
- When the characters of S2 are deleted from S, what's left is S1.
Example: the interleave of
a b c and
x y
matches
- a b c x y
- a b x c y
- a b x y c
- a x b c y
- a x b y c
- a x y b c
- x a b c y
- x a b y c
- x a y b c
- x y a b c
Determinism, again
In DTDs and XSD, the determinism rule amounts to this:
- <xsd:choice> P Q …
</xsd:choice> is a pattern only if
P and Q are patterns, and the legal first characters (or:
elements)
of P and those of Q are disjoint.
- <xsd:sequence> P Q …
</xsd:sequence> is a pattern only if
P and Q are patterns, and either
- P does not match the empty sequence, or
- the legal first characters (or: elements)
of P and those of Q are disjoint.
Determinism, in Relax NG
Relax NG has no determinism rule for choice and sequence.
But it does impose a determinism rule on interleave:
- <rng:interleave> P Q …
</rng:interleave:choice> is a pattern only if
P and Q are patterns, and the characters (or:
elements) in P and those in Q are disjoint.
Note: not just the initial characters, but all of
them,
must be disjoint. So (a, b, c) & (x, y, c) is
not legal.
Not covered here
Relax NG has some constructs we omit here for brevity.
- Rules for combining grammars from multiple sources.
- Detailed rules for datatypes.
Compact syntax (basic patterns)
Relax NG has both an XML and a non-XML syntax.
- <text> ⇒ text
- <empty> ⇒ empty
- <element> ⇒ element name
{ pattern }
- <attribute> ⇒ attribute name
{ pattern }
Compact syntax (occurrences)
- <optional> ⇒ ?
- <zeroOrMore> ⇒ *
- <oneOrMore> ⇒ +
Compact syntax (compositors)
- <choice> ⇒ |
- <group> ⇒ ,
- <interleave> ⇒ &
- <except> ⇒ -
- <mixed> ⇒ mixed
{ pattern }
Compact syntax (defined patterns)
A pattern definition (<define>) ⇒ =.
A reference (<ref>) ⇒ name.
Attributes in content models
The most striking deviation of Relax NG from the
model of DTDs and XSD (and context-free grammars in
general):
Attributes are declared in the content model,
not separately.
Consequences:
Easy to express some co-occurrence constraints,
e.g.
- Either attribute a or attribute b
- If
att1="latin" then contents are
a b c,
if att1="greek" then contents are
alpha beta gamma, ...
- Ad hoc semantics:
attribute a, x, y, z
has same meaning as
x, y, attribute a, z
Enumerations (<value>)
To enumerate the possible values of a construct,
use a choice of values:
<define name="id-number-type">
<choice>
<value>isbn</value>
<value>issn</value>
<value>lccn</value>
<value>CODEN</value>
</choice>
</define>
Or
id-number-type = ("isbn" | "issn" | "lccn" | "CODEN")
The poetry grammar
Sample RNG encodings of the poetry grammar
illustrate different styles of Relax NG usage:
The same limitations apply as for the DTD and XSD
versions:
- The two Stollen must have same number of
lines; this rule is not expressed.
- The Abgesang must have more lines than a
Stollen, fewer than Aufgesang; this rule is not
expressed.
Non-local effects
Consider the HTML
input element:
- legal only in p and similar elements
- legal only within form elements
SGML DTDs have partial solutions:
- inclusion exceptions
- content models
XSD and Relax NG have similar solutions:
- local element / type binding (XSD)
- local element / reference binding (RNG)
Non-local effects in Relax NG
Fundamentally, we trade verbosity for context-sensitivity.
First we have a normal hierarchy:
start = element doc { doc-model }
doc-model = title, para-level*, (\div | form)*
title = element title { text }
para-level = p | note | \list
p = element p { (text | phrase | note | \list)* }
\list =
element list {
element item { p+ }+
}
note = element note { p-in-note+ }
p-in-note = element p { mixed-phrases }
mixed-phrases = (text | phrase)*
phrase = emph | term | ital | bold
emph = element emph { mixed-phrases }
term = element term { mixed-phrases }
ital = element ital { mixed-phrases }
bold = element bold { mixed-phrases }
\div = element div { title, para-level*, (\div | form)* }
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of
context sensitivity).
Non-local effects (2)
Then we have a second hierarchy, within forms:
form =
element form {
attribute action { text },
para-level-in-form*,
(div-in-form)*
}
para-level-in-form = p-in-form | note-in-form | list-in-form
p-in-form =
element p {
(text | input | phrase | note-in-form | list-in-form)*
}
list-in-form =
element list {
element item { p-in-form+ }+
}
note-in-form = element note { p-in-note-in-form+ }
p-in-note-in-form = element p { (text, phrase, input)* }
input =
element input {
attribute type { text },
text
}
div-in-form = element div {
title, para-level-in-form*, (div-in-form)*
}
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of
context sensitivity).
Assignments
Assignments
Due: Sunday morning 28 October 2012.
The first assignment relates to the vocabulary described
last week (description repeated below).
Write a Relax NG schema for the vocabulary described.
If any constraints expressed in the prose are unenforceable in
RNG notation, write the schema either to
overgenerate (i.e. to accept all good documents and fail to reject
some bad ones), or else to undergenerate (i.e. to reject
all bad ones, at the cost of failing to accept some good ones).
The next assignment is our chance for a mid-course correction.
Take stock.
What modeling-related topics or questions do you most
need or want practice and help with,
over the next six or seven weeks?
What assignments would you assign yourself,
if you had someone to give you feedback on them?
Assignments, background
We defined a vocabulary consisting of the following elements,
which obey the constraints indicated.
- <doc> is the outermost element; it contains
a title, an optional copyright statement, a sequence of
zero or more paragraph-level elements, and
a sequence of zero or more sections (<div> elements),
in that order.
- <title> is the document title; it contains
text (i.e. data characters) with phrase-level markup
- <copyright> is the copyright statement; it contains
a sequence of one or more paragraphs.
- <sec> is a section; it contains
a title, a sequence of
zero or more paragraph-level elements, and
a sequence of zero or more sections (<div> elements),
in that order.
(continued ...)
Assignments, background (cont'd)
Some elements are described here as being
paragraph-level
elements; that means they can occur at the same level
as paragraphs, in sections and so on. (Sometimes we say
“they occur between paragraphs”.)
- <p> is a paragraph. It can always contain
text and phrase-level markup. As a child of <doc> it can
also contain notes and lists; as a child of <note> it
can contain lists, but not notes.
- <note> is a note. It contains a sequence
of paragraph-level elements.
- <list> is an itemized list. It contains a sequence
of <item> elements.
- <item> is a list item. You may choose whether
it contains a sequence of paragraphs,
or a sequence of paragraph-level elements.
(N.B. <item> is not a paragraph-level element;
it's listed here to be close to <list>.)
(continued ...)
Assignments, background (cont'd)
The phrase-level elements are:
- <emph> (for rhetorical emphasis)
- <term> (for technical terms)
- <cited> (for titles of books and articles cited)
- <ital> (for italics not otherwise accounted for)
- <bold> (for bold not otherwise accounted for)
All phrase level elements can contain character data and
phrase-level elements.