What does descriptive markup contribute
to digital humanities?
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
26 October 2015
http://blackmesatech.com/2015/10/KIaCiDH/
Introduction
Basic decisions about the information or substantive value
of any document rendered in a digital substrate ... are fraught
with theoretical implications.
-Johanna Drucker
Let me write a nation's data structures, and I care not who writes its code.
-Richard Ristow
The elevator pitch
The technology of descriptive markup
- is not value-free or value-neutral;
- offers a compelling account of the nature of text;
- provides a helpful foundation for better software;
- leaves room for further development.
Theses
By descriptive markup I mean (broadly) a
set of practices and the network of ideas they embody.
N.B. descriptive markup ≠ SGML ≠ XML.
Documents have structure
(1) Documents have structure worth exposing.
The kind of structure varies with the kind of document.
But: books, cantos, lines; acts, scenes, speeches, lines;
treaty, clause; ...
Surprisingly, this was historically non-obvious.
Predefined primitives
(2) No predefined set of primitive notions will be adequate for a
general-purpose document representation language.
Therefore users must be allowed
to define their own sets of basic notions (in XML:
their own element types, attributes).
Formally, SGML and XML define metalanguages;
it is users who define the actual markup languages like HTML
or TEI.
Longevity
(3) Documents can be made reusable / given longer life by representing them in a format
which is
- application-independent
- vendor-neutral
(4) Documents will be most useful when their format specifies
not how different parts of the document should be processed,
but what they are.
(5) In general, best results are obtained when document markup is
declarative not imperative, and descriptive or logical,
not tied to display, layout, or any other
form of processing.
Some consequences
It follows more or less immediately that:
Application-independence seems to
require open standards.
Reusability improves when different things are
represented differently, similar things similarly.
What things are similar, how?
This leads us directly into ontology.
Declarative semantics make it possible to
reason about representations; imperative semantics impede.
N.B. SGML and XML were defined by people
who wanted to make reusable documents using declarative descriptive semantics;
they enable these practices but cannot by themselves enforce them.
Validation
(7) Application-independence requires an independent definition of
correctness.
Otherwise, correct input means
‘whatever the program accepts’. (Cf. HTML,
Word, RTF, comma-delimited format, ...)
Checking correctness entails
(N.B. word usage varies.)
Document grammars
SGML allows (requires) the user to define
the document type to be processed.
DTD is essentially a context-free grammar.
Element structure is essentially a parse tree for that grammar.
XML is essentially a labeled-bracketing serialization of that tree.
This intertwining of grammar, tree, XML (validation method,
data structure, serialization format) very powerful.
An objection
Trees
Many have objected that
SGML (and XML) impose a hierarchical structure on texts.
For example, Johanna Drucker:
The requirements of XML were such that only a single hierarchy
could be imposed on (actually inserted into) a document. This meant
that scholars migrating mterials into electronic form frequently faced
the problem of choosing between categories or types of information to
be tagged. ... Did one chunk a text into chapters or into pages? One
could not do both, since one chapter might end and another begin on
the same page, in which case the two systems would conflict with each
other.
True or false?
Character sequences, trees, graphs
(6) Documents are more than sequences of characters.
Are SGML and XML documents always trees?
Many say so.
I think they oversimplify.
SGML and XML documents are graphs
The element structures of SGML / XML documents form trees.
But SGML and XML also define ID/IDREF relations.
These can link any
two (or more, or fewer) elements in the document.
So: SGML and XML documents are directed graphs.
(Not necessarily acyclic.)
Other data structures
We can also use SGML or XML for other data structures:
- ignore parts of the document (projection)
- interpret arcs appropriately
- restructure as needed
SGML and XML have several models
Tree and graph account of SGML / XML both oversimplify:
SGML and XML have several models (cf. RDBMS).
What does handle mean?
What does it mean to say that XML can or cannot
“handle” multiple hierarchies?
Proposal: it means it can or cannot usefully
represent them.
Representation
A given representation of data is correct
within specified limits
iff:
- Each property and value we distinguish in the data can
be distinguished in the representation.
- For each operation we wish to perform on the data,
there is a corresponding operation the machine
can perform on the representation, which produces
an appropriate result.
Properties and operations on trees
For multiple hierarchies (trees), this means:
- Type and attributes of node.
- Given parent, identify children (and vice versa).
- Given object, identify siblings.
What operations?
- Create / update / delete subtree.
- Delete node and promote children.
- Insert new node above specified children.
- ...
Representing hierarchies
Among the proposals for representation of multiple hierarchies
in XML:
- CONCUR (ISO 8879), XCONCUR (Schonefeldt).
- Standoff markup (Thompson, many others).
- Virtual elements (TEI).
- Milestone elements (TEI and folklore).
- Trojan Horse markup (DeRose, generalization of milestones).
Each clearly meets the standard of correctness.
Bias
Perhaps XML can ‘handle’
multiple hierarchies, but is biased against them?
The choice
Assume we have numerous nesting structures we must
work on. We can:
- use XML
- invent an alternative
Which will produce better results?
Ontological commitments
N.B. SGML and XML useful even if
texts are not solely made up of single hierarchies.
(The OHCO hypothesis is sufficient for SGML / XML, but not necessary.)
The argument against SGML and XML seems to require
that texts not be made up of or not exhibit any
nesting structures at all.
What now?
How do we move forward from here?
Perhaps permanent?
Perhaps descriptive markup is here to stay?
Surely these ideas are motherhood and apple pie?
- Texts have structure.
- User self-determination.
- Vendor-independence.
- Declarative descriptive semantics.
- Validation.
Permanently precarious?
Perhaps descriptive markup is only a transitory achievement.
- New universal vocabularies abound.
- Vendors never object to lock-in.
- For new apps, imperative semantics are easier and simpler.
- Openness, application-independence, vendor-independence,
and documentation are a lot of trouble.
Forward?
How do we move forward?
Can we find better ways to interact with programming languages?
(a) better APIs? (b) programming-language friendlier markup?
(c) better programming languges?
Can we find better ways to document tag sets?
Can we find better data structures?
(character ranges, Goddags, Rhine Delta graphs, other graphs)
Can we find better validation mechanisms?
(Predicate-based validation, rabbit/duck grammars, Creole, graph
grammars)
Can we use markup better in our real work?
Thank you
Thank you.
Any questions?