Document modeling

Descriptive markup, SGML, XML

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Rev. 4 September 2012

Nearby documents


Overview

Organizational notes

Bureaucracy, paperwork, and so on ...

Discussion logs

The syllabus says your work will include:
  • (once or twice during the semester) preparation of a summary of the discussion in a class session
Sign up!
Any volunteer for today?

Vocabulary introductions

The syllabus says your work will include:
  • (once during the semester) a class presentation on an important XML vocablary; you will be responsible for doing the necessary preparation, briefing the class on the origin and goals of the vocabulary, and providing a page with links to the defining documents for the vocabulary, the schema(s) for the vocabulary, and available documentation. If you are particularly energetic, you may also prepare to show the class how the vocabulary addresses the various concrete document-modeling and design questions we will ask about each vocabulary we look at.
Sign up!
Any volunteer for richtext / RFC 1341 (next week)?
Any requests for additional vocabularies?

Email

I do have an illinois.edu account.
But blackmesatech.com will reach me faster.

Questions from last week?

Anything we need to clear up before proceeding?

Documents in the real world

What have we got?

Generalizations

What have we got? Very common:
  • prose paragraphs
  • mostly English (some Greek, some Latin)
  • dialog, narration, expository prose
  • footnotes
  • pagination
  • sections

Generalizations

What have we got? Less common:
  • transaction record
  • special data types (numbers, dates, ...)
  • non-English text
  • linguistic annotation
  • born-digital material
  • graphics

What's missing?

A sample for group work

If (time permitting) we do any group tagging, we'll use one of these:
  • Signs on bridge (Oxford)
  • Descriptive markup

    Where SGML and XML came from ...
    (A just-so story.)

    Descriptive markup

    What is markup?

    Historically, markup is what the copy editor or designer wrote in the margin to guide the typesetter:

    SET ALL 9/10 × 14
    JUST
    ITC AVANT GARDE
    GOTHIC MED COND.

    Generalizing (metonymy!), markup is metatext that specifies the processing of the text.
    (Firmer definition to follow.)

    What is descriptive markup?

    Descriptive markup is markup that specifies not how to process the text, but what it is. Not

    SET ALL 9/10 × 14
    JUST
    ITC AVANT GARDE
    GOTHIC MED COND.

    but

    CHAPTER HEAD
    SECTION HEAD
    BODY TEXT

    WHY descriptive markup?

    Purely pragmatic reasons.
    • data reuse
    • greater consistency
    • easier revision, restyling
    • data longevity
    • device independence
    • application* independence

    Two roots

    What became SGML (ISO 8879)
    • GML (IBM product)
    • GenCode effort

    Descriptive markup and ontology

    Data reuse pushes us toward declarative markup and toward ontology.
    Why?
    • When might two parts of the text sometimes be treated differently?
    • When will two parts of the text always be treated the same?
    In other words: our views of what things are change more slowly than our rules for processing them (-Usdin).

    Descriptive markup and meta-language

    The GenCode committee started with the aim of defining a set of codes for use in descriptive markup.
    Then ... they changed goals: instead of defining a set of codes for everybody to use, they defined a way to define sets of codes.
    Classic instance of ‘escape to the meta-level’.

    SGML

    What's SGML? A library-type answer:

    International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.

    SGML and descriptive markup

    SGML was invented to support declarative markup.

    What is markup?

    Markup is a method
    of making explicit
    our interpretation
    of a text.

    Markup is a method
    of making explicit
    the salient properties
    of a text.

    The SGML tripod

    SGML provides:
    • a syntax for character streams
    • a validation mechanism
    • a data structure (implicit)
    For serious work with non-proprietary text, SGML has no competition.

    The SGML universe (1): content and markup

    An SGML data stream consists of:
    • content (aka parsed character data)
    • markup
      • tags
      • declarations
      • processing instructions
      • references (entity references, character references)
    So formally markup is (a) tags, declarations, PIs, and references, or equivalently (b) everything that's not content.

    The SGML universe (2): elements

    An SGML document is composed of elements, which can carry attributes.
    Elements nest (each is contained wholly in its parent).

    XML syntax

    The first leg of the tripod ...

    What's XML?

    XML is a subset of SGML.
    [Digression: what does “subset” mean here?]

    Where did XML come from?

    Institutionally: the World Wide Web Consortium. XML is defined by a W3C Recommendation.
    Intellectually and psychologically: the SGML user community.

    Delimiters

    Everything in XML is explicitly delimited.
    Elements delimited by tags.
    <greeting>Hello, world!</greeting>
    Tags delimited by angle brackets.
    <greeting>Hello, world!</greeting>
    Attribute values delimited by quotation marks.
    <greeting xml:lang="en">Hello, world!</greeting>

    Delimiters (2)

    References delimited by ampersand and semicolon.
    L'&eacute;tat, c'est moi!
    L'&#233;tat, c'est moi!
    L'&#x00E9;tat, c'est moi!

    Elements

    Elements reflect the* structure* of the text:
    <div>
    <head>The SGML universe (2):  elements</head>
    <p>An SGML document is composed of 
    <term>elements</term>, which 
    can carry 
    <term>attributes</term>.
    </p>
    <p>Elements nest (each is contained wholly in its parent).</p>
    </div>
    Each element has a basic type.

    Attributes

    Attributes record additional properties of an element:
    <greeting lang="en">Hello!</greeting>
    <greeting lang="fr">Bon soir!</greeting>
    <greeting lang="de">Gr&uuml;&szlig; Gott!</greeting>

    Comments

    Comments contain material that ‘is not part of the text’.
    <p>We will be happy to deliver this
    by the end of the month, at a rate of $...
    <!--* Andy: Please check dates and amounts!!! *-->
    </p>
    <p>Please let us know at your earliest convenience.</p>

    Processing instructions

    Processing instructions allow application-dependent data to be marked as such:
    <?TeX \vspace{2.5em}?>
    Q. But if SGML is about application independence, what are processing instructions doing here?
    A. Realism.

    Declarations

    Declarations link the document to a document grammar (aka schema or document type definition).
    <!DOCTYPE greetings SYSTEM "greetings.dtd">
    <greetings>
    <greeting>Hello, world!</greeting>
    </greetings>

    Questions

    Any questions so far?

    Document representation

    Some questions to think about ...

    Descriptive markup

    Sections and headings

    Does the vocabulary have markup for sections?
    For section headings?
    Do all sections have headings?

    Lists, paragraphs, and notes

    Does the vocabulary have markup for lists? paragraphs? footnotes? endnotes? block notes?
    Can lists occur between paragraphs?
    Can lists occur within paragraphs?
    Do paragraphs occur within list items?
    Ditto for notes.

    Phrase-level elements

    What sort of character- or phrase-level markup is allowed? (If something is italic, can I say why it's italic? Must I?)

    Direct discourse and quotation

    Does the vocabulary have markup for run-on quotations? block quotations?
    Can they be used for direct discourse in narrative?

    Poetry, drama

    Does the vocabulary have markup for verse? drama?

    Textual variation

    What happens if I must record different readings in different sources of the work?

    Annotation

    Does the vocabulary support arbitrary annotation of the document? Annotation of fixed types?

    Hypertext

    Does the vocabulary support hyperlinking? Outgoing? Incoming?

    Dates and other low-level datatypes

    Does the vocabulary have markup for dates? numbers? weights and measures? times of day? URIs?

    Metadata — inline? external?

    Does the vocabulary allow the XML document to identify itself using internal metadata? External metadata?

    A sample vocabulary

    The first of many ...

    Bare-bones TEI

    A very very small subset of the TEI vocabulary.

    Assignments

    No theory without practice.
    Due: Sunday morning 9 September 2012.