What does descriptive markup contribute

to digital humanities?

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

26 October 2015

http://blackmesatech.com/2015/10/KIaCiDH/

Introduction

Basic decisions about the information or substantive value of any document rendered in a digital substrate ... are fraught with theoretical implications.

-Johanna Drucker

Let me write a nation's data structures, and I care not who writes its code.

-Richard Ristow

The elevator pitch

The technology of descriptive markup

is not value-free or value-neutral;
offers a compelling account of the nature of text;
provides a helpful foundation for better software;
leaves room for further development.

Theses

By descriptive markup I mean (broadly) a set of practices and the network of ideas they embody.

N.B. descriptive markup ≠ SGML ≠ XML.

Documents have structure

(1) Documents have structure worth exposing.

The kind of structure varies with the kind of document. But: books, cantos, lines; acts, scenes, speeches, lines; treaty, clause; ...

Surprisingly, this was historically non-obvious.

Predefined primitives

(2) No predefined set of primitive notions will be adequate for a general-purpose document representation language.

Therefore users must be allowed to define their own sets of basic notions (in XML: their own element types, attributes).

Formally, SGML and XML define metalanguages; it is users who define the actual markup languages like HTML or TEI.

Longevity

(3) Documents can be made reusable / given longer life by representing them in a format which is

application-independent
vendor-neutral

(4) Documents will be most useful when their format specifies not how different parts of the document should be processed, but what they are.

(5) In general, best results are obtained when document markup is declarative not imperative, and descriptive or logical, not tied to display, layout, or any other form of processing.

Some consequences

It follows more or less immediately that:

Application-independence seems to require open standards.
Reusability improves when different things are represented differently, similar things similarly.

What things are similar, how? This leads us directly into ontology.
Declarative semantics make it possible to reason about representations; imperative semantics impede.

N.B. SGML and XML were defined by people who wanted to make reusable documents using declarative descriptive semantics; they enable these practices but cannot by themselves enforce them.

Validation

(7) Application-independence requires an independent definition of correctness.

Otherwise, correct input means ‘whatever the program accepts’. (Cf. HTML, Word, RTF, comma-delimited format, ...)

Checking correctness entails

verification
validation

(N.B. word usage varies.)

Document grammars

SGML allows (requires) the user to define the document type to be processed.

DTD is essentially a context-free grammar.

Element structure is essentially a parse tree for that grammar.

XML is essentially a labeled-bracketing serialization of that tree.

This intertwining of grammar, tree, XML (validation method, data structure, serialization format) very powerful.

An objection

Trees

Many have objected that SGML (and XML) impose a hierarchical structure on texts.

For example, Johanna Drucker:

The requirements of XML were such that only a single hierarchy could be imposed on (actually inserted into) a document. This meant that scholars migrating mterials into electronic form frequently faced the problem of choosing between categories or types of information to be tagged. ... Did one chunk a text into chapters or into pages? One could not do both, since one chapter might end and another begin on the same page, in which case the two systems would conflict with each other.

True or false?

Character sequences, trees, graphs

(6) Documents are more than sequences of characters.

Are SGML and XML documents always trees?

Many say so.

I think they oversimplify.

SGML and XML documents are graphs

The element structures of SGML / XML documents form trees.

But SGML and XML also define ID/IDREF relations.

These can link any two (or more, or fewer) elements in the document.

So: SGML and XML documents are directed graphs. (Not necessarily acyclic.)

Other data structures

We can also use SGML or XML for other data structures:

ignore parts of the document (projection)
interpret arcs appropriately
restructure as needed

SGML and XML have several models

Tree and graph account of SGML / XML both oversimplify: SGML and XML have several models (cf. RDBMS).

What does handle mean?

What does it mean to say that XML can or cannot “handle” multiple hierarchies?

Proposal: it means it can or cannot usefully represent them.

Representation

A given representation of data is correct within specified limits iff:

Each property and value we distinguish in the data can be distinguished in the representation.
For each operation we wish to perform on the data, there is a corresponding operation the machine can perform on the representation, which produces an appropriate result.

Properties and operations on trees

For multiple hierarchies (trees), this means:

Type and attributes of node.
Given parent, identify children (and vice versa).
Given object, identify siblings.

What operations?

Create / update / delete subtree.
Delete node and promote children.
Insert new node above specified children.
...

Representing hierarchies

Among the proposals for representation of multiple hierarchies in XML:

CONCUR (ISO 8879), XCONCUR (Schonefeldt).
Standoff markup (Thompson, many others).
Virtual elements (TEI).
Milestone elements (TEI and folklore).
Trojan Horse markup (DeRose, generalization of milestones).

Each clearly meets the standard of correctness.

Bias

Perhaps XML can ‘handle’ multiple hierarchies, but is biased against them?

The choice

Assume we have numerous nesting structures we must work on. We can:

use XML
invent an alternative

Which will produce better results?

Ontological commitments

N.B. SGML and XML useful even if texts are not solely made up of single hierarchies.

(The OHCO hypothesis is sufficient for SGML / XML, but not necessary.)

The argument against SGML and XML seems to require that texts not be made up of or not exhibit any nesting structures at all.

What now?

How do we move forward from here?

Perhaps permanent?

Perhaps descriptive markup is here to stay? Surely these ideas are motherhood and apple pie?

Texts have structure.
User self-determination.
Vendor-independence.
Declarative descriptive semantics.
Validation.

Permanently precarious?

Perhaps descriptive markup is only a transitory achievement.

New universal vocabularies abound.
Vendors never object to lock-in.
For new apps, imperative semantics are easier and simpler.
Openness, application-independence, vendor-independence, and documentation are a lot of trouble.

Forward?

How do we move forward?

Can we find better ways to interact with programming languages?

(a) better APIs? (b) programming-language friendlier markup? (c) better programming languges?
Can we find better ways to document tag sets?
Can we find better data structures?

(character ranges, Goddags, Rhine Delta graphs, other graphs)
Can we find better validation mechanisms?

(Predicate-based validation, rabbit/duck grammars, Creole, graph grammars)
Can we use markup better in our real work?

Thank you

Thank you.

Any questions?

Acknowledgements

Photo: Detail from Black Mesa, by Marcin Wichary, 9 September 2008 (CC BY 2.0)