Document modeling
Descriptive markup, SGML, XML
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
Rev. 4 September 2012
Overview
- Organizational notes
- Documents in the real world
- Descriptive markup
- XML syntax
- Specimen problems in document representation
- Bare-bones TEI
Organizational notes
Bureaucracy, paperwork, and so on ...
Discussion logs
The syllabus says your work will include:
- (once or twice during the semester) preparation
of a summary of the discussion in a class session
Sign up!
Any volunteer for today?
Vocabulary introductions
The syllabus says your work will include:
-
(once during the semester) a class presentation on an important XML
vocablary; you will be responsible for doing the necessary
preparation, briefing the class on the origin and goals of the
vocabulary, and providing a page with links to the defining documents
for the vocabulary, the schema(s) for the vocabulary, and available
documentation. If you are particularly energetic, you may also prepare
to show the class how the vocabulary addresses the various concrete
document-modeling and design questions we will ask about each
vocabulary we look at.
Sign up!
Any volunteer for richtext / RFC 1341 (next week)?
Any requests for additional vocabularies?
Email
I do have an illinois.edu account.
But blackmesatech.com will reach me faster.
Questions from last week?
Anything we need to clear up before proceeding?
Documents in the real world
What have we got?
- Once around the table:
- What's the document?
- Is it typical or a corner case?
- What's typical?
- What's unusual? What pushes an edge? Which edge?
Particular questions or problems?
- Some generalizations
Generalizations
What have we got? Very common:
- prose paragraphs
- mostly English (some Greek, some Latin)
- dialog, narration, expository prose
- footnotes
- pagination
- sections
Generalizations
What have we got? Less common:
- transaction record
- special data types (numbers, dates, ...)
- non-English text
- linguistic annotation
- born-digital material
- graphics
Descriptive markup
Where SGML and XML came from ...
(A just-so story.)
Descriptive markup
- What is descriptive markup?
- Origins
- Some implications
What is markup?
Historically,
markup is what
the copy editor or designer wrote in the
margin to guide the typesetter:
SET ALL 9/10 × 14
JUST
ITC AVANT GARDE
GOTHIC MED COND.
Generalizing (metonymy!), markup is
metatext that specifies the processing of
the text.
(Firmer definition to follow.)
What is descriptive markup?
Descriptive markup is
markup that specifies not
how to process
the text, but
what it is.
Not
SET ALL 9/10 × 14
JUST
ITC AVANT GARDE
GOTHIC MED COND.
but
CHAPTER HEAD
SECTION HEAD
BODY TEXT
WHY descriptive markup?
Purely pragmatic reasons.
- data reuse
- greater consistency
- easier revision, restyling
- data longevity
- device independence
- application* independence
Two roots
What became SGML (ISO 8879)
- GML (IBM product)
- GenCode effort
Descriptive markup and ontology
Data reuse pushes us toward declarative
markup and toward ontology.
Why?
- When might two parts of the text sometimes
be treated differently?
- When will two parts of the text always
be treated the same?
In other words:
our views of what things are change more slowly
than our rules for processing them (-Usdin).
Descriptive markup and meta-language
The GenCode committee started with the aim of
defining a set of codes for use in descriptive markup.
Then ... they changed goals:
instead of defining a set of codes for everybody to use,
they defined
a way to define sets of codes.
Classic instance of ‘escape to the meta-level’.
SGML
What's SGML? A library-type answer:
International Organization for Standardization (ISO).
1986.
ISO 8879-1986
(E). Information processing — Text and Office Systems —
Standard Generalized Markup Language (SGML). International
Organization for Standardization, Geneva, 1986.
SGML and descriptive markup
SGML was invented to support declarative markup.
What is markup?
Markup is a method
of making explicit
our interpretation
of a text.
Markup is a method
of making explicit
the salient properties
of a text.
The SGML tripod
SGML provides:
- a syntax for character streams
- a validation mechanism
- a data structure (implicit)
For serious work with non-proprietary text,
SGML has no competition.
The SGML universe (1): content and markup
An SGML data stream consists of:
- content (aka parsed character data)
- markup
- tags
- declarations
- processing instructions
- references (entity references, character references)
So formally markup is
(a) tags, declarations, PIs, and references,
or equivalently (b) everything that's not
content.
The SGML universe (2): elements
An SGML document is composed of
elements, which
can carry
attributes.
Elements nest (each is contained wholly in its parent).
XML syntax
The first leg of the tripod ...
What's XML?
XML is a subset of SGML.
[Digression: what does “subset” mean here?]
Where did XML come from?
Intellectually and psychologically:
the SGML user community.
Delimiters
Everything in XML is explicitly delimited.
Elements delimited by tags.
<greeting>Hello, world!</greeting>
Tags delimited by angle brackets.
<greeting>Hello, world!</greeting>
Attribute values delimited by quotation marks.
<greeting xml:lang="en">Hello, world!</greeting>
Delimiters (2)
References delimited by ampersand and semicolon.
L'état, c'est moi!
L'état, c'est moi!
L'état, c'est moi!
Elements
Elements reflect the* structure* of the text:
<div>
<head>The SGML universe (2): elements</head>
<p>An SGML document is composed of
<term>elements</term>, which
can carry
<term>attributes</term>.
</p>
<p>Elements nest (each is contained wholly in its parent).</p>
</div>
Each element has a basic type.
Attributes
Attributes record additional properties of an element:
<greeting lang="en">Hello!</greeting>
<greeting lang="fr">Bon soir!</greeting>
<greeting lang="de">Grüß Gott!</greeting>
Comments
Comments contain material that
‘is not part of the text’.
<p>We will be happy to deliver this
by the end of the month, at a rate of $...
<!--* Andy: Please check dates and amounts!!! *-->
</p>
<p>Please let us know at your earliest convenience.</p>
Processing instructions
Processing instructions allow
application-dependent data to be marked
as such:
<?TeX \vspace{2.5em}?>
Q. But if SGML is about application
independence, what are processing instructions
doing here?
A. Realism.
Declarations
Declarations link the document to a document
grammar (aka schema or document type definition).
<!DOCTYPE greetings SYSTEM "greetings.dtd">
<greetings>
<greeting>Hello, world!</greeting>
</greetings>
Questions
Any questions so far?
Document representation
Some questions to think about ...
Descriptive markup
- sections and headings
- lists, paragraphs, and notes
- phrase-level elements
- direct discourse and quotation
- poetry, drama
- textual variation
- annotation
- dates and other low-level datatypes
- metadata — inline? external?
Sections and headings
Does the vocabulary have markup for
sections?
For section headings?
Do all sections have headings?
Lists, paragraphs, and notes
Does the vocabulary have markup for
lists? paragraphs? footnotes? endnotes?
block notes?
Can lists occur between paragraphs?
Can lists occur within paragraphs?
Do paragraphs occur within list items?
Ditto for notes.
Phrase-level elements
What sort of character- or phrase-level markup
is allowed? (If something is italic, can I say
why it's italic? Must I?)
Direct discourse and quotation
Does the vocabulary have markup for
run-on quotations? block quotations?
Can they be used for direct discourse in narrative?
Poetry, drama
Does the vocabulary have markup for
verse? drama?
Textual variation
What happens if I must record different
readings in different sources of the work?
Annotation
Does the vocabulary support
arbitrary annotation of the document?
Annotation of fixed types?
Hypertext
Does the vocabulary support
hyperlinking?
Outgoing?
Incoming?
Dates and other low-level datatypes
Does the vocabulary have markup for
dates? numbers? weights and measures? times of day?
URIs?
Metadata — inline? external?
Does the vocabulary allow the XML document to
identify itself using internal metadata?
External metadata?
A sample vocabulary
The first of many ...
Bare-bones TEI
A very very small subset of the TEI vocabulary.
Assignments
No theory without practice.
-
Make an XML version of your document, making up the tags
as you go along. You must decide:
- What needs to be marked up.
- How to mark it up.
-
Make an XML version of your document using Bare Bones TEI.
You must decide:
- What needs to be marked up.
- How to mark it up using the Bare Bones TEI vocabulary
(if you can), or how to compensate (otherwise).
Due: Sunday morning 9 September 2012.