Document modeling
A backward glance
and
More on DTDs
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
Rev. 18 September 2012
Overview
- Organizational notes
- Vocabulary tour
- Historical survey: document modeling before electronic computers
- More on DTDs
- Assignments for 30 September
Organizational notes
Bureaucracy, paperwork, and so on ...
Signups
Reminder about
- minute-taking
- vocabulary survey
Questions from last week?
Anything we need to clear up before proceeding?
Tour of vocabularies, cont'd
Historical survey - Introduction
Where are we?
You are here
Overview
Representation of individual documents
- introduction to SGML and XML
- syntax (angle brackets)
- model (trees)
- DTDs
- → historical survey
But also here
Overview
Representation of individual documents
- introduction to SGML and XML
- syntax (angle brackets)
- model (trees)
- (→) DTDs
- → historical survey
What is document modeling?
Remember how we defined
document:
- (for physical documents) an artifact carrying
writing; typically an artifact whose main purpose
is to carry writing
- (for electronic documents) an object
managed by a computer system which resembles
a non-electronic document in function
Some obvious questions
- What do we mean by writing?
- Does the nature of writing impose
constraints on documents? on document modeling?
- Do different kinds of writing lead to
different kinds of document?
The basic technology: writing
In which we worry about foundations.
What is writing?
Coulmas* identifies three “fundamental characteristics”
of writing:
- artificial graphical marks on durable surface
- intended to communicate something
- has conventional relation to language
* Florian Coulmas, The writing systems
of the world (Oxford: Blackwell, 1989), p. 17.
What is writing?
Sampson* is more general:
To ‘write’ might be defined,
at a first approximation as:
to communicate relatively specific ideas
by means of permanent, visible marks.
* Geoffrey Sampson, Writing systems:
A linguistic introduction (Stanford: Stanford
University Press, 1985), p. 26.
Contrast:
- sign language
- works of art
A recursive loop?
Q. If writing is an attempt to
represent some
(ideational or linguistic)
content, then ...
... doesn't that mean that writing is
itself an attempt to model ideas and/or language?
So is document modeling just a second-level attempt to
model something that is itself already a model?
A. Yes and yes, it certainly looks that way.
Benjamin's Kunstwerk
Cf. Walter Benjamin, Das Kunstwerk
im Zeitalter seiner technischen Reproduzierbarkeit
(1936).
A selective, tendentious summary: reproductions
- may degrade / represent imperfectly;
- may approach ubiquity;
- not fixed in time or space;
- lack the aura
of the original.
Language and technical reproducibility
Q. What technology enables the technological
reproducibility of language?
A. Writing.
Q. How does writing model language?
Kinds of writing
Sampson offers this typology*:
Writing:
- semasiographic (?)
- glottographic
- logographic
- based on polymorphemic unit (e.g. word) (?)
- morphemic
- phonographic
- syllabic
- segmental
- based on phonological features
* “(?)” mark categories S sees as open to question.
Semasiography
Writing that represents
ideational content directly.
- Amerindian ‘picture writing’
and other remote examples (?)
- language-independent* instruction manuals
- international signage icons
- mathematics
- diagram languages (UML, chemical bond diagrams, ...)
Glottography
Writing that represents
ideational content indirectly, by
representing language.*
- one word, one symbol? (unattested*)
- one morpheme, one symbol (Han characters*)
- one syllable (form), one symbol (katakana, hiragana, Linear B,
Cherokee, ...)
- one phonetic segment, one symbol (most alphabets)
- one phonological feature, one symbol (Han-gul)
Note: most actual writing systems are mixtures, not pure.
* So document models model documents,
which model language,
which models ideas?
Some oppositions
- iconic (motivated) vs arbitrary
- complete vs incomplete (defective)
- deep vs shallow
- linear vs. non-linear*
All writing systems are incomplete
N.B.
no writing system captures
all aspects of language:
- lexical identity (ambiguity)
- suprasegmentals (intonation, volume, timbre)
- vowels (?)
At one extreme, we may be unable to distinguish
semasiography from very incomplete glottography.
The writing substrate
- tablet
- stone
- sheet
- scroll
- codex
A simple document model
What document model does a Linear B inscription have?
- A document
- is a (relatively short) sequence
- of syllables.
Scroll, codex, and linearity
Scrolls are sequential access devices.
Codices are random access or
direct access devices.
Service books, missals.
Printing
The second most important technology.
Woodblocks
7c China (for Buddhist scriptures);
15c Europe (block books).
Mixture of text and image.
Movable type
11c China (baked clay)
13c Korea (metal)
15c Europe (metal)
What characters to cast into type?
Industrial printing
Post-incunabular books develop
secondary
non-linear properties:
- title page
- front, back matter
- running heads
- catch words
- page numbers
- signature numbers
Envelope-pushing
Both before and after print,
some books go beyond
reproduction of language.
- initials, miniatures
- indexing (12c, first Biblical concordances)
- image poetry
Facsimiles
A logical continuation of woodblock line:
an analog representation.
Not: text and image,
but text as image.
Codes
In which we speed through history.
Semaphores
Semaphore telegraph / Napoleonic semaphore:
- line-of-sight communication
- angled rods (or shuttered panels)
- one position, one symbol
(in French case, two arms × seven positions
× one crossbar × 4 angles,
7 × 7 × 4 = 196 symbols;
two-symbol codes using 92 positions,
8,464 meanings (92 × 92).
- code books
Flag semaphores
Also 19th century (?)
- A-Z
- 1-9 (= A-I), 0 (= K)
- rest / space
- error / attention
- invitation to transmit (= K)
- acknowledge / correct (= C)
- shift-to-numbers
- shift-to-letters (= J)
Morse code
1836-1844, Morse et al.
- A-Z
- 0-9
- gaps (1, 3, 7 dots long)
- spelled-out punctuation, or
punctuation conventions
- full stop (AAA)
- comma (MIM)
- etc. (. , ? ' / ( ) : ; = + - " @)
- prosigns (message control): wait, invitation to transmit, error,
end of work, understood, starting signal
The document model of Morse code
For Morse code (and flag semaphore), a document is
a sequence
of characters
and gaps.
Punched cards
In which we arrive on the threshold of electronic computing.
Hollerith cards
1890 census: Herman Hollerith.
Unit record equipment:
- card punch
- collator
- sorter
- tabulator
- interpreter
Punched-card (‘Hollerith’) code
In standard design (IBM, 1928), a card has:
- 80 columns
- rectangular holes
- ten rows for digit punch
- two rows for zone punch
______________________________________________
/&-0123456789ABCDEFGHIJKLMNOPQR/STUVWXYZ
Y / x xxxxxxxxx
X| x xxxxxxxxx
0| x xxxxxxxxx
1| x x x x
2| x x x x
3| x x x x
4| x x x x
5| x x x x
6| x x x x
7| x x x x
8| x x x x
9| x x x x
|________________________________________________
Punched-card model
In early uses, card divided into fields.
- columns 1-20: name
- column 21: sex
- columns 23-24: age
- ...
Cf. simple database records.
Horizontal text
One way to represent texts with cards:
- columns 1-72: one line of text
- columns 73-80: line ID (play, act, scene, line;
book, chapter, verse; ...)
Further on DTDs
Some constructs you'll see.
Notation declarations
<!NOTATION
name
external identifier
>
Examples (in Oxygen)
Parameter entity declarations
<!ENTITY
%
name
external identifier
optional ‘internal subset’
>
Examples (in Oxygen)
What got left out of XML DTDs
Several SGML DTD constructs
support features omitted from XML.
- SHORTREF (<!SHORTREF foo "bar" baz>,
<!USEMAP foo barracuda>)
- RANK
- and-connector (&)
- case folding of element, attribute names
What got trimmed out of XML DTDs
Several SGML DTD constructs were trimmed down in XML.
- default entities (entity name #DEFAULT)
- tag omissibility
(<!ELEMENT list-item - O (#PCDATA | %phrase;)*>
- attribute types (NUMBER, NUMBERS, NUTOKEN, NUTOKENS)
- inclusion and exclusion exceptions
Assignments
Due: Sunday morning 30 September 2012.
1 Do a TEI version of your document, valid against
some schema or DTD built by the TEI's Roma software.
(If you want to keep it simple, use TEI Lite.)
2 Questions about ISO 8879 Annex E.
3 Simple DTD syntax quiz.