graphic with four colored squares

XML Schema (XSD) 1.1

What's new?

C. M. Sperberg-McQueen, Black Mesa Technologies

http://www.blackmesatech.com/2009/07/xsd11

Summer XML 2009, 27 July 2009


Overview and introduction

Overview

Background

Classes of change in XSD 1.1

Editorial changes

Conceptual clarifications

Nine specific changes to look for

Datatypes changes

Numbers and significant digits

Is 5 = 5.00?
In normal engineering practice, no: one significant digit vs. three.
In xsd:decimal, yes: equal and identical.
In xsd:precisionDecimal, 5 and 5.00 are equal, but not identical.

Precision decimal

New xsd:precisionDecimal datatype: like xsd:decimal, but tracks significant digits. To require measurement to the nearest hundredth:
  <xsd:simpleType name="measurement">
    <xsd:restriction base="xsd:precisionDecimal">
      <xsd:minScale value="2"/>
      <xsd:maxScale value="2"/>
     </xsd:restriction>
  </xsd:simpleType>

The explicitTimezone facet

Time zone can be required:
  <xsd:simpleType name="dateTimeStamp">
    <xsd:restriction base="xsd:dateTime">
      <xsd:explicitTimezone fixed="true"
        value="required"/>
     </xsd:restriction>
  </xsd:simpleType>
N.B. you don't need to define this one: it's built in.

The explicitTimezone facet (2)

Time zone can be forbidden:
<xsd:simpleType name='bare-date'>
  <xsd:restriction base='xsd:date'>
    <xsd:explicitTimezone value='prohibited'/>
  </xsd:restriction>
</xsd:simpleType>

Changes to the ID datatype

Now legal:
  • list of xsd:ID
  • unions of xsd:ID and other types
  • default and fixed xsd:ID values*
  • multiple xsd:ID attributes on same element
    (N.B. multiple ID elements were already legal.)

Supporting xml:id with XSD 1.1

So you can now write:
 <xsd:complexType name="...">
  <!--* ... *-->
  <xsd:attribute name="id" type="xsd:ID"/>
  <xsd:attribute ref="xml:id"/>
 </xsd:complexType>

Q: How do I ... ?

Common questions:
  • “How do I say that if xml:lang="ja", the element has type my:Japanese-prose, and otherwise my:Western-prose?”
  • “How do I say that if attribute a is "v1", then b must not be any of v2, v3, v4?”
  • “How do I require that the count attribute give the correct number of children?”
  • etc.

A: Co-occurrence constraints

Two forms:
  • conditional type assignment (CTA)
  • assertions (check clauses)
Both involve evaluation of XPath expressions.

Conditional type assignment

Based on work by Fabio Vitali et al.
  • Normal type assignment: one declared type for each element.
    • Element E has type T.
  • Conditional type assignment: sequence of test + type pairs.
    • Element E has ...
    • If 〈test 1〉 then type T1
    • else if 〈test 2〉 then type T2
    • ...
    • else if 〈test n〉 then type Tn
    • else type T

Supporting Atom message types with XSD 1.1

 
<xsd:element name="message" type="messageType">
  <xsd:alternative test="@kind='string'" 
                   type="messageTypeString"/>
  <xsd:alternative test="@kind='base64'" 
                   type="messageTypeBase64"/>
  <xsd:alternative test="@kind='binary'" 
                   type="messageTypeBase64"/>
  <xsd:alternative test="@kind='xml'"    
                   type="messageTypeXML"/>
  <xsd:alternative test="@kind='XML'"    
                   type="messageTypeXML"/>
</xsd:element>

Internationalizing your prose

Conditional type assignment works with xml:lang:
 
<xsd:element name="para" type="my:proseType">
  <xsd:alternative test="@xml:lang='ja'" 
                   type="my:prose_japanese"/>
  <xsd:alternative test="@xml:lang='ar'" 
                   type="my:prose_arabic"/>
  <xsd:alternative test="@xml:lang='he'" 
                   type="my:prose_hebrew"/>
</xsd:element>

Restrictions on CTA tests

Tests can refer to
  • constants
  • attributes on the element itself
but not to
  • ancestors, siblings, children, descendants
  • attributes on the above
So context-independence and pre-order type assignment are preserved.

Inherited attributes

But wait! xml:lang isn't always specified: mostly it's inherited!
If I can't refer to my ancestors, how does this work?!

Supporting xml:lang with XSD 1.1

 
 <xsd:attribute name="lang" inheritable="true">
  <!--* ... *-->
  <xsd:simpleType>
   <xsd:union memberTypes="xsd:language">
    <xsd:simpleType>    
     <xsd:restriction base="xsd:string">
      <xsd:enumeration value=""/>
     </xsd:restriction>
    </xsd:simpleType>
   </xsd:union>
  </xsd:simpleType>
 </xsd:attribute>

Assertions

Like SQL check clauses, a way to specify additional constraints in the form of query predicates.
Cf. also Schematron.
<xsd:complexType name="intRange">
 <xsd:attribute name="min" type="xsd:int"/>
 <xsd:attribute name="max" type="xsd:int"/>
 <xsd:assert test="@min le @max"/>
</xsd:complexType>

Checking the count attribute

<xsd:complexType name="arrayType">
 <xsd:sequence>
  <xsd:element name="entry" 
    minOccurs="0" maxOccurs="unbounded"/>
 </xsd:sequence>
 <xsd:attribute name="length" type="xs:int"/>
 <xsd:assert test="@length eq fn:count(./entry)"/>
</xsd:complexType>

XPath evaluation

  • XPath expressions as predicates:
    • True → OK
    • False → not valid
    • Xpath error → not valid
  • Evaluated on a subtree.
  • Full XPath is legal; for CTA, implementations may support subset.
  • Attributes, descendants are typed; element itself is not typed.

CTA vs. Assertions (vs. Schematron)

Revision to UPA

The ‘unique particle attribution’ rule (aka ‘determinism’ rule) has changed.
Informally: (when you match input against model, you must) know without lookahead which token matches the input.
(a?, b?, c?)         // deterministic
(a, a?)              // deterministic
(a?, a)              // non-deterministic
(a+, (b | c)*, d+)*  // (strongly) deteterministic
(a+, (b | c)*, d*)*  // (weakly) deteterministic
(my:a | #ANY)*       // I'm glad you asked ...

Competition

paths: For any model M, every sentence in L(M) traces a path in M.
E.g. the path of “a a b d a d” in (a+, (b | c)*, d+)*.
competition: Two particles P1 and P2 compete when some input has two paths Q1, Q2 in M different only in last item:
ambiguity: If some sequence S in L(M) has two paths in M, then M is ambiguous.
determinism: Stronger than ambiguity: no S has two paths in M, even SL(M).

New UPA rule

(1) M satisfies new UPA iff:
  • No two element particles in M compete.
  • No two wildcard particles in M compete.
(2) A validation path uses element particles, not wildcards, when elements and wildcards compete. Affects particle matching, type assignment.
S locally valid iff S has validation path (not just any path) in M. N.B. V(M) ⊆ L(M).
N.B. for some M, V(M) ≠ L(M): e.g. (#any?, a) or ((#any, x) | (a, b)).

Weak wildcards in XSD 1.1 (basic structure)

Simple-minded design for personName:
<personName xmlns="http://example.com/ns>
  <given>Dave</given>
  <surname>Orchard</surname>
</personName>
The declaration:
 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:element ref="tns:given"/>
   <xsd:element ref="tns:surname"/>
  </xsd:sequence>
 </xsd:complexType>

Weak wildcards 2 (planning extensibility)

But suppose
  • We expect we may revise this schema.
  • We want 1.0 handlers to accept 2.0 messages.
  • We expect that 1.0 handlers
    • accept valid 1.0 messages
    • reject all others
    N.B. this does not go without saying. No law says you cannot process invalid data.

Weak wildcards 3 (worrying about 2.0)

  • We worry about 2.0. What if 2.0 wants to allow
    <personName xmlns="http://example.com/ns>
      <given>Dave</given>
      <middle>Bryce</middle>
      <surname>Orchard</surname>
    </personName>
    ?

Weak wildcards 4 (1.0 wildcards)

We'd like to write
 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>
But this violates UPA. (Why?)

Weak wildcards 5 (1.1 wildcards)

In XSD 1.1, this example is legal:
 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

Weak wildcards 6 (a 1.1 gotcha)

But in XSD 1.1, this example is still illegal:
 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:middle" minOccurs="0"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>
(Why?)

Weak wildcards 7 (a legal 1.1 formulation)

A standard technique for solving this kind of UPA violation:
 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:sequence minOccurs="0">
     <xsd:element ref="tns:middle"/>
     <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   </xsd:sequence>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

Open content: the problem

All we wanted to say was:
We want
  • a given element
  • followed optionally by a middle
  • followed by a surname
  • with anything at all allowed before, between, and after.
That is, what some schema languages call ‘open content’.
Why is that so hard?

Open content: the feature

OK. Old version:
 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:element ref="tns:given"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   <xsd:sequence minOccurs="0">
     <xsd:element ref="tns:middle"/>
     <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   </xsd:sequence>
   <xsd:element ref="tns:surname"/>
   <xsd:any minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

Open content: the feature (2)

OK. New version:
 <xsd:complexType name="personname">
  <xsd:sequence>
   <xsd:openContent>
    <xsd:any minOccurs="0" maxOccurs="unbounded"/>
   </xsd:openContent>
   <xsd:element ref="tns:given"/>
   <xsd:element ref="tns:middle" minOccurs="0"/>
   <xsd:element ref="tns:surname"/>
  </xsd:sequence>
 </xsd:complexType>
A xsd:defaultOpenContent element can provide a default open-content wildcard for all types in a schema document.

Other wildcard changes

Other enhancements to wildcards:
  • negative wildcards (exclude certain namespaces, certain QNames)
  • not-my-sibling wildcards
  • not-in-schema wildcards (BEWARE!)

Changes to substitution groups

Makes common document architectures much easier.
But N.B. UPA can still be a problem.

Using substitution groups for extensibility

It's easy to make a vocabulary extensible:
 <xsd:element name="regex" 
                 type="my:xsd_regex" 
                 substitutionGroup="tns:phrase tns:code"/>
 </xsd:element>
 <xsd:element name="formula" 
                 type="my:fopc" 
                 substitutionGroup="tns:display tns:chunk"/>
 </xsd:element>

XSD versioning

Version-control attributes

Every element in a schema document can have version-control attributes:
  • vc:minVersion, vc:maxVersion: What version of XSD does the validator support?
  • vc:typeAvailable, vc:typeUnavailable: Are these types built-in?
  • vc:facetAvailable, vc:facetUnavailable: Does the processor understand these facets?
Expected usage: test for implementation-defined extensions.

Using version-control in schema documents

We want to use precision decimal if we can, or double otherwise:
 <xsd:element vc:typeAvailable="xsd:precisionDecimal"
              name="datum" 
              type="xsd:precisionDecimal"  />
 <xsd:element vc:typeUnavailable="xsd:precisionDecimal"
              name="datum" 
              type="xsd:double" />
XSD 1.0 processors can (and should) retrofit.
An online demo exists.

Lax fallback

What happens when a type is missing?
In 1.0:
  • either skip
  • or ‘fall back to lax assessment’ (with xsd:anyType)
In 1.1:
  • fall back to lax assessment with xsd:anyType (always)
May affect throughput.
Better interoperability.

Lax fallback for xsi:type

What happens when an instance-specified type is missing?
In 1.0: fail.
But we know it's either
  • a restriction of declared type
  • an extension of declared type
In 1.1: fall back to declared type.

Deploying XSD 1.1

Many changes are intended to empower users:
  • named subsets of PSVI
  • new names for conformance levels
  • error detection required
  • implementation-defined primitives, facets, built-ins
  • schema composition terminology
  • invocation terminology
  • choice of XML 1.0 or XML 1.1
  • implementation-defined, implementation-dependent

Implementation-defined, implementation-dependent

Following XQuery, XSLT 2.0, and SQL, XSD 1.1 distinguishes
  • implementation-defined: may vary from vendor to vendor. Must be documented.
  • implementation-dependent: may vary from moment to moment. Need not be documented (and in fact, documentation discouraged).

Checklist of implementation-defined features

Full list in the specs. Highlights:
  • Use of XML 1.0 or XML 1.1 name rules.
  • Read schema documents (or hard-coded schema?)
  • Web-aware?
  • Possible methods of invocation.
  • What parts of PSVI available to user? How?
  • Schema composition policy.
  • Additional primitives?
  • Implementation limits (max integer, etc.)

XML 1.0 or XML 1.1?

Use XML 1.0 or XML 1.1 definition of NCName?

Conforming implementations of this specification may provide either the 1.1-based datatypes or the 1.0-based datatypes, or both. If both are supported, the choice of which datatypes to use in a particular assessment episode should be under user control.

N.B. Some vendors would prefer not to give users the choice.

The PSVI

The PSVI is an abstraction, not an API or data format.
How much of it does your processor expose? How?
How much of it do you need? In what kind of form?

Named PSVI subsets

Some points of reference:
  • root-validity (valid? validation attempted? error?)
  • instance-validity (as above, for each element and attribute)
  • type-aware (as above, plus particle, declaration, type definition, etc.)
  • lightweight type-aware (as above, using names of types, not full type info)
  • full instance (everything but the components)
  • full PSVI with components (everything including the components)

Conformance levels

OldNew
minimally conformingminimally conforming
in conformance to the XML Representation of Schemaschema-document aware
fully conformingWeb-aware

Starting schema-validity assessment

  • type-driven: start here, using type definition foo
  • element-driven: start here, using element declaation bar
  • attribute-driven: start here, using attribute declaration baz
  • lax-wildcard validation: start here, as if matching a lax wildcard (i.e. either find the declaration or fall back to lax processing).
    Missing declaration? Not a problem.
  • strict-wildcard validation: start here, as if matching a strict wildcard (i.e. either find the declaration or fall back to lax processing).*
    Missing declaration? It's the end of the world.*

Schema composition terminology

Want to know a secret?
Processors can do what they like.
  • Where does the processor look?
  • Methods of indirection
  • What to use in indirect lookup
  • When to stop
  • Reacting to failure

Where to look

Where do schemas come from?
  • hard-coded schemas
  • automatically known components
  • hard-coded schema locations
  • named pairs (run-time options)
  • user-specified schema documents
  • interactive inquiry
  • namespace name
  • schemaLocation hints in the XML instance
  • schemaLocation hints in schema documents
  • local repository

Indirection

User control over where to look?
  • path indirection / search path
  • URI indirection
  • catalogs
  • local repository
  • recursion (multi-step lookup)
  • non-recursion

What's the key?

Indirection looks things up. What does it look up?
  • namespace name
  • location
  • other ... ?

Stopping

When to stop?
  • stop after first success
  • consult all locations / resources
What if it's not there?
  • error
  • continue

Why all these terms?

The new terminology allows
  • vendors — to say what their software does
  • users — to say what they want their software to do
  • other specs — to say what schema validators must do, to work with their spec
Design assumption: you have power, you can negotiate.
Take charge!