XML Schema (XSD) 1.1
What's new?
C. M. Sperberg-McQueen, Black Mesa Technologies
Summer XML 2009, 27 July 2009
Overview and introduction
Overview
- Introduction
- Background
- Classes of change in XSD 1.1
- Some new features in XSD 1.1
- Datatypes changes
- Co-occurrence constraints
- Other declaration changes
- Versioning XSD itself
- Deploying XSD 1.1
Background
- XSD 1.0 Recommendation: May 2001
- XSD 1.0 Second Edition: October 2004
- What we expected:
- minor tweaks
- some additional functionality
- What happened ...
Classes of change in XSD 1.1
- Bug fixes
- Editorial improvements
- Conceptual clarifications, simplifications
- Alignment with other specs
- Improved functionality, convenience
- including support for easier versioning,
extensibility, backward compatibility
Editorial changes
- new name!
- new terminology
- reorganized XML mappings
- if vs. if and only if
- must ... be vs. must ... must
- sentence-by-sentence revisions
- ... and hundreds more.
Conceptual clarifications
- restriction rules simplified
- post-schema-validation info set (PSVI)
- validation rules and terminology
- conformance levels
- schema composition terminology
- ... and more
Nine specific changes to look for
- New datatypes (and other Datatypes changes)
- Changes to the ID datatype
- Conditional type assignment
- Assertions
- Weakened wildcards
- Open content
- Changes to substitution groups
- XSD versioning
- Lax fallback
Datatypes changes
- Clarifications
- (Arithmetic) equality vs. identity
- Tighter value space / lexical space story
- New datatypes:
- anyAtomicType
- yearMonthDuration
- dayTimeDuration
- dateTimeStamp
- precisionDecimal
- Implementation-defined primitives
Numbers and significant digits
Is 5 = 5.00?
In normal engineering practice, no:
one significant digit vs. three.
In xsd:decimal, yes: equal and identical.
In xsd:precisionDecimal, 5 and 5.00 are
equal, but not identical.
Precision decimal
New
xsd:precisionDecimal datatype:
like
xsd:decimal, but tracks significant
digits. To require measurement to the nearest
hundredth:
<xsd:simpleType name="measurement">
<xsd:restriction base="xsd:precisionDecimal">
<xsd:minScale value="2"/>
<xsd:maxScale value="2"/>
</xsd:restriction>
</xsd:simpleType>
The explicitTimezone facet
Time zone can be required:
<xsd:simpleType name="dateTimeStamp">
<xsd:restriction base="xsd:dateTime">
<xsd:explicitTimezone fixed="true"
value="required"/>
</xsd:restriction>
</xsd:simpleType>
N.B. you don't need to define this one: it's built in.
The explicitTimezone facet (2)
Time zone can be forbidden:
<xsd:simpleType name='bare-date'>
<xsd:restriction base='xsd:date'>
<xsd:explicitTimezone value='prohibited'/>
</xsd:restriction>
</xsd:simpleType>
Changes to the ID datatype
Now legal:
list of xsd:ID
unions of xsd:ID and other types
default and fixed xsd:ID values*
multiple xsd:ID attributes on
same element
(N.B. multiple ID elements were already legal.)
Supporting xml:id with XSD 1.1
So you can now write:
<xsd:complexType name="...">
<!--* ... *-->
<xsd:attribute name="id" type="xsd:ID"/>
<xsd:attribute ref="xml:id"/>
</xsd:complexType>
Q: How do I ... ?
Common questions:
- “How do I say that if xml:lang="ja",
the element has type my:Japanese-prose,
and otherwise my:Western-prose?”
- “How do I say that if attribute a
is "v1", then b must
not be any of v2,
v3, v4?”
- “How do I require that the count
attribute give the correct number of children?”
- etc.
A: Co-occurrence constraints
Two forms:
- conditional type assignment (CTA)
- assertions (check clauses)
Both involve evaluation of XPath expressions.
Conditional type assignment
Based on work by Fabio Vitali et al.
- Normal type assignment: one declared type for each element.
- Conditional type assignment: sequence of test + type pairs.
- Element E has ...
- If 〈test 1〉 then type T1
- else if 〈test 2〉 then type T2
- ...
- else if 〈test n〉 then type Tn
- else type T
Supporting Atom message types with XSD 1.1
<xsd:element name="message" type="messageType">
<xsd:alternative test="@kind='string'"
type="messageTypeString"/>
<xsd:alternative test="@kind='base64'"
type="messageTypeBase64"/>
<xsd:alternative test="@kind='binary'"
type="messageTypeBase64"/>
<xsd:alternative test="@kind='xml'"
type="messageTypeXML"/>
<xsd:alternative test="@kind='XML'"
type="messageTypeXML"/>
</xsd:element>
Internationalizing your prose
Conditional type assignment works with
xml:lang:
<xsd:element name="para" type="my:proseType">
<xsd:alternative test="@xml:lang='ja'"
type="my:prose_japanese"/>
<xsd:alternative test="@xml:lang='ar'"
type="my:prose_arabic"/>
<xsd:alternative test="@xml:lang='he'"
type="my:prose_hebrew"/>
</xsd:element>
Restrictions on CTA tests
Tests can refer to
- constants
- attributes on the element itself
but
not to
- ancestors, siblings, children, descendants
- attributes on the above
So
context-independence and
pre-order type assignment
are preserved.
Inherited attributes
But wait! xml:lang isn't always
specified: mostly it's inherited!
If I can't refer to my ancestors, how does this work?!
Supporting xml:lang with XSD 1.1
<xsd:attribute name="lang" inheritable="true">
<!--* ... *-->
<xsd:simpleType>
<xsd:union memberTypes="xsd:language">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value=""/>
</xsd:restriction>
</xsd:simpleType>
</xsd:union>
</xsd:simpleType>
</xsd:attribute>
Assertions
Like SQL check clauses, a way to specify additional constraints
in the form of query predicates.
Cf. also Schematron.
<xsd:complexType name="intRange">
<xsd:attribute name="min" type="xsd:int"/>
<xsd:attribute name="max" type="xsd:int"/>
<xsd:assert test="@min le @max"/>
</xsd:complexType>
Checking the count attribute
<xsd:complexType name="arrayType">
<xsd:sequence>
<xsd:element name="entry"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="length" type="xs:int"/>
<xsd:assert test="@length eq fn:count(./entry)"/>
</xsd:complexType>
XPath evaluation
- XPath expressions as predicates:
- True → OK
- False → not valid
- Xpath error → not valid
- Evaluated on a subtree.
- Full XPath is legal;
for CTA, implementations may support subset.
- Attributes, descendants are typed;
element itself is not typed.
CTA vs. Assertions (vs. Schematron)
- CTA:
- element - type binding
- very restricted XDM (attributes only)
- Assertions:
- part of type
- XDM has (typed) subtree
- Schematron:
- typically element-based
- may point anywhere in the document
- not type-aware*
- streamable?
Revision to UPA
The ‘unique particle attribution’ rule
(aka ‘determinism’ rule)
has changed.
Informally: (when you match input against
model, you must) know without lookahead which
token matches the input.
(a?, b?, c?) // deterministic
(a, a?) // deterministic
(a?, a) // non-deterministic
(a+, (b | c)*, d+)* // (strongly) deteterministic
(a+, (b | c)*, d*)* // (weakly) deteterministic
(my:a | #ANY)* // I'm glad you asked ...
Competition
paths:
For any model M, every sentence in
L(M) traces a path in M.
E.g. the path of “a a b d a d” in
(a+, (b | c)*, d+)*.
competition:
Two particles P1 and P2 compete when some input
has two paths Q1, Q2 in M different only in last item:
- Q1 = Q0 + < P1 >
- Q2 = Q0 + < P2 >
ambiguity:
If some sequence S in L(M) has two paths in M,
then M is ambiguous.
determinism:
Stronger than ambiguity: no S has two paths in M,
even
S ∉ L(M).
New UPA rule
(1)
M satisfies new UPA iff:
- No two element particles in M compete.
- No two wildcard particles in M compete.
(2) A validation path uses element particles,
not wildcards, when elements and wildcards compete.
Affects particle matching, type assignment.
S locally valid iff S has validation path
(not just any path) in M.
N.B. V(M) ⊆ L(M).
N.B. for some M, V(M) ≠ L(M): e.g.
(#any?, a)
or ((#any, x) | (a, b)).
Weak wildcards in XSD 1.1 (basic structure)
Simple-minded design for
personName:
<personName xmlns="http://example.com/ns>
<given>Dave</given>
<surname>Orchard</surname>
</personName>
The declaration:
<xsd:complexType name="personname">
<xsd:sequence>
<xsd:element ref="tns:given"/>
<xsd:element ref="tns:surname"/>
</xsd:sequence>
</xsd:complexType>
Weak wildcards 2 (planning extensibility)
But suppose
- We expect we may revise this schema.
- We want 1.0 handlers to accept 2.0 messages.
- We expect that 1.0 handlers
- accept valid 1.0 messages
- reject all others
N.B. this does not go without saying.
No law says you cannot process invalid data.
Weak wildcards 3 (worrying about 2.0)
Weak wildcards 4 (1.0 wildcards)
We'd like to write
<xsd:complexType name="personname">
<xsd:sequence>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:given"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:surname"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
But this violates UPA. (Why?)
Weak wildcards 5 (1.1 wildcards)
In XSD 1.1, this example is legal:
<xsd:complexType name="personname">
<xsd:sequence>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:given"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:surname"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
Weak wildcards 6 (a 1.1 gotcha)
But in XSD 1.1, this example is still
illegal:
<xsd:complexType name="personname">
<xsd:sequence>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:given"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:middle" minOccurs="0"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:surname"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
(Why?)
Weak wildcards 7 (a legal 1.1 formulation)
A standard technique for solving this kind of UPA violation:
<xsd:complexType name="personname">
<xsd:sequence>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:given"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:sequence minOccurs="0">
<xsd:element ref="tns:middle"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:element ref="tns:surname"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
Open content: the problem
All we wanted to say was:
We want
- a given element
- followed optionally by a middle
- followed by a surname
- with anything at all allowed before, between, and after.
That is, what some schema languages call ‘open content’.
Why is that so hard?
Open content: the feature
OK. Old version:
<xsd:complexType name="personname">
<xsd:sequence>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:element ref="tns:given"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
<xsd:sequence minOccurs="0">
<xsd:element ref="tns:middle"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:element ref="tns:surname"/>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
Open content: the feature (2)
OK. New version:
<xsd:complexType name="personname">
<xsd:sequence>
<xsd:openContent>
<xsd:any minOccurs="0" maxOccurs="unbounded"/>
</xsd:openContent>
<xsd:element ref="tns:given"/>
<xsd:element ref="tns:middle" minOccurs="0"/>
<xsd:element ref="tns:surname"/>
</xsd:sequence>
</xsd:complexType>
A xsd:defaultOpenContent element can provide
a default open-content wildcard for all types in a schema document.
Other wildcard changes
Other enhancements to wildcards:
- negative wildcards (exclude certain namespaces, certain QNames)
- not-my-sibling wildcards
- not-in-schema wildcards (BEWARE!)
Changes to substitution groups
- Multiple substitution group heads
- Abstract elements in substitution groups
Makes common document architectures much easier.
But N.B. UPA can still be a problem.
Using substitution groups for extensibility
It's easy to make a vocabulary extensible:
<xsd:element name="regex"
type="my:xsd_regex"
substitutionGroup="tns:phrase tns:code"/>
</xsd:element>
<xsd:element name="formula"
type="my:fopc"
substitutionGroup="tns:display tns:chunk"/>
</xsd:element>
XSD versioning
- xsd:schema now has open content
- block="#all" is gone
- identifiers for versions of XSD (coexistence)
- implementation-defined primitives
- implementations can allow user-defined primitives
- version-control (vc:*) attributes
Version-control attributes
Every element in a schema document can have version-control attributes:
- vc:minVersion, vc:maxVersion:
What version of XSD does the validator support?
- vc:typeAvailable, vc:typeUnavailable:
Are these types built-in?
- vc:facetAvailable, vc:facetUnavailable:
Does the processor understand these facets?
Expected usage: test for implementation-defined extensions.
Using version-control in schema documents
We want to use precision decimal if we can,
or double otherwise:
<xsd:element vc:typeAvailable="xsd:precisionDecimal"
name="datum"
type="xsd:precisionDecimal" />
<xsd:element vc:typeUnavailable="xsd:precisionDecimal"
name="datum"
type="xsd:double" />
XSD 1.0 processors can (and should) retrofit.
Lax fallback
What happens when a type is missing?
In 1.0:
- either skip
- or ‘fall back to lax assessment’ (with
xsd:anyType)
In 1.1:
- fall back to lax assessment with
xsd:anyType (always)
May affect throughput.
Better interoperability.
Lax fallback for xsi:type
What happens when an instance-specified type is missing?
In 1.0: fail.
But we know it's either
- a restriction of declared type
- an extension of declared type
In 1.1: fall back to declared type.
Deploying XSD 1.1
Many changes are intended to empower users:
- named subsets of PSVI
- new names for conformance levels
- error detection required
- implementation-defined primitives, facets, built-ins
- schema composition terminology
- invocation terminology
- choice of XML 1.0 or XML 1.1
- implementation-defined, implementation-dependent
Implementation-defined, implementation-dependent
Following XQuery, XSLT 2.0, and SQL, XSD 1.1 distinguishes
- implementation-defined: may vary from
vendor to vendor. Must be documented.
- implementation-dependent: may vary from
moment to moment. Need not be documented
(and in fact, documentation discouraged).
Checklist of implementation-defined features
Full list in the specs. Highlights:
- Use of XML 1.0 or XML 1.1 name rules.
- Read schema documents (or hard-coded schema?)
- Web-aware?
- Possible methods of invocation.
- What parts of PSVI available to user? How?
- Schema composition policy.
- Additional primitives?
- Implementation limits (max integer, etc.)
XML 1.0 or XML 1.1?
Use XML 1.0 or XML 1.1 definition of NCName?
Conforming implementations of this specification may provide
either the 1.1-based datatypes or the 1.0-based datatypes, or both. If
both are supported, the choice of which datatypes to use in a
particular assessment episode should be under user control.
N.B. Some vendors would prefer not to give
users the choice.
The PSVI
The PSVI is an abstraction, not an API or data format.
How much of it does your processor expose? How?
How much of it do you need? In what kind of form?
Named PSVI subsets
Some points of reference:
- root-validity (valid? validation attempted? error?)
- instance-validity (as above, for
each element and attribute)
- type-aware (as above, plus particle, declaration, type
definition, etc.)
- lightweight type-aware (as above, using names of types,
not full type info)
- full instance (everything but the components)
- full PSVI with components (everything including
the components)
Conformance levels
Old | New |
minimally conforming | minimally conforming |
in conformance to the XML Representation of Schema | schema-document aware |
fully conforming | Web-aware |
Starting schema-validity assessment
type-driven: start here, using type definition foo
element-driven: start here, using element declaation bar
attribute-driven: start here, using attribute declaration baz
lax-wildcard validation: start here, as if matching a lax
wildcard (i.e. either find the declaration or fall back to lax processing).
Missing declaration? Not a problem.
strict-wildcard validation: start here, as if matching a strict
wildcard (i.e. either find the declaration or fall back to lax processing).*
Missing declaration? It's the end of the world.*
Schema composition terminology
Want to know a secret?
Processors can do what they like.
- Where does the processor look?
- Methods of indirection
- What to use in indirect lookup
- When to stop
- Reacting to failure
Where to look
Where do schemas come from?
- hard-coded schemas
- automatically known components
- hard-coded schema locations
- named pairs (run-time options)
- user-specified schema documents
- interactive inquiry
- namespace name
- schemaLocation hints in the XML instance
- schemaLocation hints in schema documents
- local repository
Indirection
User control over where to look?
- path indirection / search path
- URI indirection
- catalogs
- local repository
- recursion (multi-step lookup)
- non-recursion
What's the key?
Indirection looks things up. What
does it look up?
- namespace name
- location
- other ... ?
Stopping
When to stop?
- stop after first success
- consult all locations / resources
Why all these terms?
The new terminology allows
- vendors — to say what their software does
- users — to say what they want their software to do
- other specs — to say what schema validators
must do, to work with their spec
Design assumption: you have power,
you can negotiate.
Take charge!