What does descriptive markup contribute
to digital humanities?
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
26 October 2015
Basic decisions about the information or substantive value
of any document rendered in a digital substrate ... are fraught
with theoretical implications.
Let me write a nation's data structures, and I care not who writes its code.
The elevator pitch
The technology of descriptive markup
- is not value-free or value-neutral;
- offers a compelling account of the nature of text;
- provides a helpful foundation for better software;
- leaves room for further development.
By descriptive markup I mean (broadly) a
set of practices and the network of ideas they embody.
N.B. descriptive markup ≠ SGML ≠ XML.
Documents have structure
(1) Documents have structure worth exposing.
The kind of structure varies with the kind of document.
But: books, cantos, lines; acts, scenes, speeches, lines;
treaty, clause; ...
Surprisingly, this was historically non-obvious.
(2) No predefined set of primitive notions will be adequate for a
general-purpose document representation language.
Therefore users must be allowed
to define their own sets of basic notions (in XML:
their own element types, attributes).
Formally, SGML and XML define metalanguages;
it is users who define the actual markup languages like HTML
(3) Documents can be made reusable / given longer life by representing them in a format
(4) Documents will be most useful when their format specifies
not how different parts of the document should be processed,
but what they are.
(5) In general, best results are obtained when document markup is
declarative not imperative, and descriptive or logical,
not tied to display, layout, or any other
form of processing.
It follows more or less immediately that:
Application-independence seems to
require open standards.
Reusability improves when different things are
represented differently, similar things similarly.
What things are similar, how?
This leads us directly into ontology.
Declarative semantics make it possible to
reason about representations; imperative semantics impede.
N.B. SGML and XML were defined by people
who wanted to make reusable documents using declarative descriptive semantics;
they enable these practices but cannot by themselves enforce them.
(7) Application-independence requires an independent definition of
Otherwise, correct input means
‘whatever the program accepts’. (Cf. HTML,
Word, RTF, comma-delimited format, ...)
Checking correctness entails
(N.B. word usage varies.)
SGML allows (requires) the user to define
the document type to be processed.
DTD is essentially a context-free grammar.
Element structure is essentially a parse tree for that grammar.
XML is essentially a labeled-bracketing serialization of that tree.
This intertwining of grammar, tree, XML (validation method,
data structure, serialization format) very powerful.
Many have objected that
SGML (and XML) impose a hierarchical structure on texts.
For example, Johanna Drucker:
The requirements of XML were such that only a single hierarchy
could be imposed on (actually inserted into) a document. This meant
that scholars migrating mterials into electronic form frequently faced
the problem of choosing between categories or types of information to
be tagged. ... Did one chunk a text into chapters or into pages? One
could not do both, since one chapter might end and another begin on
the same page, in which case the two systems would conflict with each
True or false?
Character sequences, trees, graphs
(6) Documents are more than sequences of characters.
Are SGML and XML documents always trees?
Many say so.
I think they oversimplify.
SGML and XML documents are graphs
The element structures of SGML / XML documents form trees.
But SGML and XML also define ID/IDREF relations.
These can link any
two (or more, or fewer) elements in the document.
So: SGML and XML documents are directed graphs.
(Not necessarily acyclic.)
Other data structures
We can also use SGML or XML for other data structures:
- ignore parts of the document (projection)
- interpret arcs appropriately
- restructure as needed
SGML and XML have several models
Tree and graph account of SGML / XML both oversimplify:
SGML and XML have several models (cf. RDBMS).
What does handle mean?
What does it mean to say that XML can or cannot
“handle” multiple hierarchies?
Proposal: it means it can or cannot usefully
A given representation of data is correct
within specified limits
- Each property and value we distinguish in the data can
be distinguished in the representation.
- For each operation we wish to perform on the data,
there is a corresponding operation the machine
can perform on the representation, which produces
an appropriate result.
Properties and operations on trees
For multiple hierarchies (trees), this means:
- Type and attributes of node.
- Given parent, identify children (and vice versa).
- Given object, identify siblings.
- Create / update / delete subtree.
- Delete node and promote children.
- Insert new node above specified children.
Among the proposals for representation of multiple hierarchies
- CONCUR (ISO 8879), XCONCUR (Schonefeldt).
- Standoff markup (Thompson, many others).
- Virtual elements (TEI).
- Milestone elements (TEI and folklore).
- Trojan Horse markup (DeRose, generalization of milestones).
Each clearly meets the standard of correctness.
Perhaps XML can ‘handle’
multiple hierarchies, but is biased against them?
Assume we have numerous nesting structures we must
work on. We can:
- use XML
- invent an alternative
Which will produce better results?
N.B. SGML and XML useful even if
texts are not solely made up of single hierarchies.
(The OHCO hypothesis is sufficient for SGML / XML, but not necessary.)
The argument against SGML and XML seems to require
that texts not be made up of or not exhibit any
nesting structures at all.
How do we move forward from here?
Perhaps descriptive markup is here to stay?
Surely these ideas are motherhood and apple pie?
- Texts have structure.
- User self-determination.
- Declarative descriptive semantics.
Perhaps descriptive markup is only a transitory achievement.
- New universal vocabularies abound.
- Vendors never object to lock-in.
- For new apps, imperative semantics are easier and simpler.
- Openness, application-independence, vendor-independence,
and documentation are a lot of trouble.
How do we move forward?
Can we find better ways to interact with programming languages?
(a) better APIs? (b) programming-language friendlier markup?
(c) better programming languges?
Can we find better ways to document tag sets?
Can we find better data structures?
(character ranges, Goddags, Rhine Delta graphs, other graphs)
Can we find better validation mechanisms?
(Predicate-based validation, rabbit/duck grammars, Creole, graph
Can we use markup better in our real work?