XML Schema (XSD) 1.1

What's new?

C. M. Sperberg-McQueen, Black Mesa Technologies

http://www.blackmesatech.com/2009/07/xsd11

Summer XML 2009, 27 July 2009

Overview and introduction

Overview

Introduction
- Background
- Classes of change in XSD 1.1
Some new features in XSD 1.1
- Datatypes changes
- Co-occurrence constraints
- Other declaration changes
- Versioning XSD itself
Deploying XSD 1.1

Background

XSD 1.0 Recommendation: May 2001
XSD 1.0 Second Edition: October 2004
What we expected:
- minor tweaks
- some additional functionality
What happened ...

Classes of change in XSD 1.1

Bug fixes
Editorial improvements
Conceptual clarifications, simplifications
Alignment with other specs
Improved functionality, convenience
- including support for easier versioning, extensibility, backward compatibility

Editorial changes

new name!
new terminology
reorganized XML mappings
if vs. if and only if
must ... be vs. must ... must
sentence-by-sentence revisions
... and hundreds more.

Conceptual clarifications

restriction rules simplified
post-schema-validation info set (PSVI)
validation rules and terminology
conformance levels
schema composition terminology
... and more

Nine specific changes to look for

New datatypes (and other Datatypes changes)
Changes to the ID datatype
Conditional type assignment
Assertions
Weakened wildcards
Open content
Changes to substitution groups
XSD versioning
Lax fallback

Datatypes changes

Clarifications
- (Arithmetic) equality vs. identity
- Tighter value space / lexical space story
New datatypes:
- anyAtomicType
- yearMonthDuration
- dayTimeDuration
- dateTimeStamp
- precisionDecimal
Implementation-defined primitives

Numbers and significant digits

Is 5 = 5.00?

In normal engineering practice, no: one significant digit vs. three.

In xsd:decimal, yes: equal and identical.

In xsd:precisionDecimal, 5 and 5.00 are equal, but not identical.

Precision decimal

New xsd:precisionDecimal datatype: like xsd:decimal, but tracks significant digits. To require measurement to the nearest hundredth:

  <xsd:simpleType name="measurement">
    <xsd:restriction base="xsd:precisionDecimal">
      <xsd:minScale value="2"/>
      <xsd:maxScale value="2"/>
     </xsd:restriction>
  </xsd:simpleType>

The `explicitTimezone` facet

Time zone can be required:

  <xsd:simpleType name="dateTimeStamp">
    <xsd:restriction base="xsd:dateTime">
      <xsd:explicitTimezone fixed="true"
        value="required"/>
     </xsd:restriction>
  </xsd:simpleType>

N.B. you don't need to define this one: it's built in.

The `explicitTimezone` facet (2)

Time zone can be forbidden:

<xsd:simpleType name='bare-date'>
  <xsd:restriction base='xsd:date'>
    <xsd:explicitTimezone value='prohibited'/>
  </xsd:restriction>
</xsd:simpleType>

Changes to the ID datatype

Now legal:

list of xsd:ID
unions of xsd:ID and other types
default and fixed xsd:ID values*
multiple xsd:ID attributes on same element

(N.B. multiple ID elements were already legal.)

Supporting `xml:id` with XSD 1.1

So you can now write:

 <xsd:complexType name="...">
  <!--* ... *-->
  <xsd:attribute name="id" type="xsd:ID"/>
  <xsd:attribute ref="xml:id"/>
 </xsd:complexType>

Q: How do I ... ?

Common questions:

“How do I say that if xml:lang="ja", the element has type my:Japanese-prose, and otherwise my:Western-prose?”
“How do I say that if attribute a is "v1", then b must not be any of v2, v3, v4?”
“How do I require that the count attribute give the correct number of children?”
etc.

A: Co-occurrence constraints

Two forms:

conditional type assignment (CTA)
assertions (check clauses)

Both involve evaluation of XPath expressions.

Conditional type assignment

Based on work by Fabio Vitali et al.

Normal type assignment: one declared type for each element.
- Element E has type T.
Conditional type assignment: sequence of test + type pairs.
- Element E has ...
- If 〈test 1〉 then type T1
- else if 〈test 2〉 then type T2
- ...
- else if 〈test n〉 then type Tn
- else type T

Supporting Atom message types with XSD 1.1

 
<xsd:element name="message" type="messageType">
  <xsd:alternative test="@kind='string'" 
                   type="messageTypeString"/>
  <xsd:alternative test="@kind='base64'" 
                   type="messageTypeBase64"/>
  <xsd:alternative test="@kind='binary'" 
                   type="messageTypeBase64"/>
  <xsd:alternative test="@kind='xml'"    
                   type="messageTypeXML"/>
  <xsd:alternative test="@kind='XML'"    
                   type="messageTypeXML"/>
</xsd:element>

Internationalizing your prose

Conditional type assignment works with xml:lang:

 
<xsd:element name="para" type="my:proseType">
  <xsd:alternative test="@xml:lang='ja'" 
                   type="my:prose_japanese"/>
  <xsd:alternative test="@xml:lang='ar'" 
                   type="my:prose_arabic"/>
  <xsd:alternative test="@xml:lang='he'" 
                   type="my:prose_hebrew"/>
</xsd:element>

Restrictions on CTA tests

Tests can refer to

constants
attributes on the element itself

but not to

ancestors, siblings, children, descendants
attributes on the above

So context-independence and pre-order type assignment are preserved.

Inherited attributes

But wait! xml:lang isn't always specified: mostly it's inherited!

If I can't refer to my ancestors, how does this work?!

Supporting `xml:lang` with XSD 1.1

 
 <xsd:attribute name="lang" inheritable="true">
  <!--* ... *-->
  <xsd:simpleType>
   <xsd:union memberTypes="xsd:language">
    <xsd:simpleType>    
     <xsd:restriction base="xsd:string">
      <xsd:enumeration value=""/>
     </xsd:restriction>
    </xsd:simpleType>
   </xsd:union>
  </xsd:simpleType>
 </xsd:attribute>

Assertions

Like SQL check clauses, a way to specify additional constraints in the form of query predicates.

Cf. also Schematron.

<xsd:complexType name="intRange">
 <xsd:attribute name="min" type="xsd:int"/>
 <xsd:attribute name="max" type="xsd:int"/>
 <xsd:assert test="@min le @max"/>
</xsd:complexType>

Checking the `count` attribute

<xsd:complexType name="arrayType">
 <xsd:sequence>
  <xsd:element name="entry" 
    minOccurs="0" maxOccurs="unbounded"/>
 </xsd:sequence>
 <xsd:attribute name="length" type="xs:int"/>
 <xsd:assert test="@length eq fn:count(./entry)"/>
</xsd:complexType>

XPath evaluation

XPath expressions as predicates:
- True → OK
- False → not valid
- Xpath error → not valid
Evaluated on a subtree.
Full XPath is legal; for CTA, implementations may support subset.
Attributes, descendants are typed; element itself is not typed.

CTA vs. Assertions (vs. Schematron)

CTA:
- element - type binding
- very restricted XDM (attributes only)
Assertions:
- part of type
- XDM has (typed) subtree
Schematron:
- typically element-based
- may point anywhere in the document
- not type-aware*
- streamable?

Revision to UPA

The ‘unique particle attribution’ rule (aka ‘determinism’ rule) has changed.

Informally: (when you match input against model, you must) know without lookahead which token matches the input.

(a?, b?, c?) // deterministic

(a, a?) // deterministic

(a?, a) // non-deterministic

(a+, (b | c)*, d+)* // (strongly) deteterministic

(a+, (b | c)*, d*)* // (weakly) deteterministic

(my:a | #ANY)* // I'm glad you asked ...

Competition

paths: For any model M, every sentence in L(M) traces a path in M.

E.g. the path of “a a b d a d” in (a+, (b | c)*, d+)*.

competition: Two particles P₁ and P₂ compete when some input has two paths Q₁, Q₂ in M different only in last item:

Q₁ = Q₀ + < P₁ >
Q₂ = Q₀ + < P₂ >

ambiguity: If some sequence S in L(M) has two paths in M, then M is ambiguous.

determinism: Stronger than ambiguity: no S has two paths in M, even S ∉ L(M).

New UPA rule

(1) M satisfies new UPA iff:

No two element particles in M compete.
No two wildcard particles in M compete.

(2) A validation path uses element particles, not wildcards, when elements and wildcards compete. Affects particle matching, type assignment.

S locally valid iff S has validation path (not just any path) in M. N.B. V(M) ⊆ L(M).

N.B. for some M, V(M) ≠ L(M): e.g. (#any?, a) or ((#any, x) | (a, b)).

Weak wildcards in XSD 1.1 (basic structure)

Simple-minded design for personName:

<personName xmlns="http://example.com/ns>
  <given>Dave</given>
  <surname>Orchard</surname>
</personName>

The declaration:

 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:element ref="tns:given"/>
   <xsd:element ref="tns:surname"/>
  </xsd:sequence>
 </xsd:complexType>

Weak wildcards 2 (planning extensibility)

But suppose

We expect we may revise this schema.
We want 1.0 handlers to accept 2.0 messages.
We expect that 1.0 handlers
- accept valid 1.0 messages
- reject all others
N.B. this does not go without saying. No law says you cannot process invalid data.

Weak wildcards 3 (worrying about 2.0)

We worry about 2.0. What if 2.0 wants to allow

<personName xmlns="http://example.com/ns>
  <given>Dave</given>
  <middle>Bryce</middle>
  <surname>Orchard</surname>
</personName>

Weak wildcards 4 (1.0 wildcards)

We'd like to write

 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

But this violates UPA. (Why?)

Weak wildcards 5 (1.1 wildcards)

In XSD 1.1, this example is legal:

 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

Weak wildcards 6 (a 1.1 gotcha)

But in XSD 1.1, this example is still illegal:

 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:middle" minOccurs="0"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

(Why?)

Weak wildcards 7 (a legal 1.1 formulation)

A standard technique for solving this kind of UPA violation:

 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:sequence minOccurs="0">
     <xsd:element ref="tns:middle"/>
     <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   </xsd:sequence>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

Open content: the problem

All we wanted to say was:

We want

a given element
followed optionally by a middle
followed by a surname
with anything at all allowed before, between, and after.

That is, what some schema languages call ‘open content’.

Why is that so hard?

Open content: the feature

OK. Old version:

 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:sequence minOccurs="0">
     <xsd:element ref="tns:middle"/>
     <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   </xsd:sequence>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

Open content: the feature (2)

OK. New version:

 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:openContent>
    <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   </xsd:openContent>
   <xsd:element ref="tns:given"/>
   <xsd:element ref="tns:middle" minOccurs="0"/>
   <xsd:element ref="tns:surname"/>
  </xsd:sequence>
 </xsd:complexType>

A xsd:defaultOpenContent element can provide a default open-content wildcard for all types in a schema document.

Other wildcard changes

Other enhancements to wildcards:

negative wildcards (exclude certain namespaces, certain QNames)
not-my-sibling wildcards
not-in-schema wildcards (BEWARE!)

Changes to substitution groups

Multiple substitution group heads
Abstract elements in substitution groups

Makes common document architectures much easier.

But N.B. UPA can still be a problem.

Using substitution groups for extensibility

It's easy to make a vocabulary extensible:

 <xsd:element name="regex" 
                 type="my:xsd_regex" 
                 substitutionGroup="tns:phrase tns:code"/>
 </xsd:element>
 <xsd:element name="formula" 
                 type="my:fopc" 
                 substitutionGroup="tns:display tns:chunk"/>
 </xsd:element>

XSD versioning

xsd:schema now has open content
block="#all" is gone
identifiers for versions of XSD (coexistence)
implementation-defined primitives
- implementations can allow user-defined primitives
version-control (vc:*) attributes

Version-control attributes

Every element in a schema document can have version-control attributes:

vc:minVersion, vc:maxVersion: What version of XSD does the validator support?
vc:typeAvailable, vc:typeUnavailable: Are these types built-in?
vc:facetAvailable, vc:facetUnavailable: Does the processor understand these facets?

Expected usage: test for implementation-defined extensions.

Using version-control in schema documents

We want to use precision decimal if we can, or double otherwise:

 <xsd:element vc:typeAvailable="xsd:precisionDecimal"
              name="datum" 
              type="xsd:precisionDecimal"  />
 <xsd:element vc:typeUnavailable="xsd:precisionDecimal"
              name="datum" 
              type="xsd:double" />

XSD 1.0 processors can (and should) retrofit.

An online demo exists.

Lax fallback

What happens when a type is missing?

In 1.0:

either skip
or ‘fall back to lax assessment’ (with xsd:anyType)

In 1.1:

fall back to lax assessment with xsd:anyType (always)

May affect throughput.

Better interoperability.

Lax fallback for `xsi:type`

What happens when an instance-specified type is missing?

In 1.0: fail.

But we know it's either

a restriction of declared type
an extension of declared type

In 1.1: fall back to declared type.

Deploying XSD 1.1

Many changes are intended to empower users:

named subsets of PSVI
new names for conformance levels
error detection required
implementation-defined primitives, facets, built-ins
schema composition terminology
invocation terminology
choice of XML 1.0 or XML 1.1
implementation-defined, implementation-dependent

Implementation-defined, implementation-dependent

Following XQuery, XSLT 2.0, and SQL, XSD 1.1 distinguishes

implementation-defined: may vary from vendor to vendor. Must be documented.
implementation-dependent: may vary from moment to moment. Need not be documented (and in fact, documentation discouraged).

Checklist of implementation-defined features

Full list in the specs. Highlights:

Use of XML 1.0 or XML 1.1 name rules.
Read schema documents (or hard-coded schema?)
Web-aware?
Possible methods of invocation.
What parts of PSVI available to user? How?
Schema composition policy.
Additional primitives?
Implementation limits (max integer, etc.)

XML 1.0 or XML 1.1?

Use XML 1.0 or XML 1.1 definition of NCName?

Conforming implementations of this specification may provide either the 1.1-based datatypes or the 1.0-based datatypes, or both. If both are supported, the choice of which datatypes to use in a particular assessment episode should be under user control.

N.B. Some vendors would prefer not to give users the choice.

The PSVI

The PSVI is an abstraction, not an API or data format.

How much of it does your processor expose? How?

How much of it do you need? In what kind of form?

Named PSVI subsets

Some points of reference:

root-validity (valid? validation attempted? error?)
instance-validity (as above, for each element and attribute)
type-aware (as above, plus particle, declaration, type definition, etc.)
lightweight type-aware (as above, using names of types, not full type info)
full instance (everything but the components)
full PSVI with components (everything including the components)

Conformance levels

Old	New
minimally conforming	minimally conforming
in conformance to the XML Representation of Schema	schema-document aware
fully conforming	Web-aware

Starting schema-validity assessment

type-driven: start here, using type definition foo
element-driven: start here, using element declaation bar
attribute-driven: start here, using attribute declaration baz
lax-wildcard validation: start here, as if matching a lax wildcard (i.e. either find the declaration or fall back to lax processing).

Missing declaration? Not a problem.
strict-wildcard validation: start here, as if matching a strict wildcard (i.e. either find the declaration or fall back to lax processing).*

Missing declaration? It's the end of the world.*

Schema composition terminology

Want to know a secret?

Processors can do what they like.

Where does the processor look?
Methods of indirection
What to use in indirect lookup
When to stop
Reacting to failure

Where to look

Where do schemas come from?

hard-coded schemas
automatically known components
hard-coded schema locations
named pairs (run-time options)
user-specified schema documents
interactive inquiry
namespace name
schemaLocation hints in the XML instance
schemaLocation hints in schema documents
local repository

Indirection

User control over where to look?

path indirection / search path
URI indirection
catalogs
local repository
recursion (multi-step lookup)
non-recursion

What's the key?

Indirection looks things up. What does it look up?

namespace name
location
other ... ?

Stopping

When to stop?

stop after first success
consult all locations / resources

What if it's not there?

error
continue

Why all these terms?

The new terminology allows

vendors — to say what their software does
users — to say what they want their software to do
other specs — to say what schema validators must do, to work with their spec

Design assumption: you have power, you can negotiate.

Take charge!