Markup semantics and the preservation of intellectual content

The data curator reads Ecclesiastes

C. M. Sperberg-McQueen

Urbana, 20 May 2009

Confessions, disclosures, warnings

radicals and responsibles
IANAL (BSOMBFAL)
texts ≠ books
texts and images
literature, language, text representation, markup, validation

What are we talking about here?

When we talk about curating humanities data, what kinds of data do we mean?

What I'm talking about

When we talk about curating humanities data, what kinds of data do we mean?

editions of works (e.g. Moby Dick, The Electronic Beowulf)
editions of authors (e.g. Collected Works of Thomas Middleton)
archives of an author, movement, theme, with (extensive) ancillary material (e.g. The Blake Archive, Der junge Goethe)
exhaustive corpora of a language and period (Anglo-Saxon Poetic Records, Thesaurus Linguae Graecae, Trésor de la langue française, ...)
dictionaries (e.g. the Diction of Old English or the Dictionary of the Old Spanish Language), with ancillary materials [see corpora])
modern corpora for linguistic work, with annotation (Penn TreeBank, Brown Corpus, Lancaster-Oslo/Bergen corpus, Corpus of London Teenage Language (COLT), British National Corpus (BNC), TIGer (Trees in German), ...)
corpora for historical dictionaries (DOE, DOSL, Thesaurus Linguae Latinae Hiberniae)
art-historical slide collections
3-D reconstructions (e.g. the Chicago Columbian Exposition, the Crystal Palace, ...)
raw and digitally enhanced page images of an important manuscript (e.g. the Beowulf manuscript)
A suite of Fortran programs that generate fugues; a set of fugues written by those programs (a) in the program's output notation, (b) in synthesized sound, (c) as recorded by the school's chamber orchestra
A rhyme database covering verse of a particular author / language / period
A set of dialect maps
Transcriptions of field-linguists' informant interviews
Important programs

N.B. this is not quite the same field as defined by Palmer and Craigin.

SHRDLU: an instructive example

1968-70, MIT: Terry Winograd writes SHRDLU, a milestone in the history of artificial intelligence, natural-language processing, and cognitive science.

The source code is available.

But apparently can't be run today: it's mostly in MicroPLANNER.

Various reconstructions are available.

Which is to be preserved, made accessible, curated?

Books are for use

Books are for use.

-S. R. Ranganathan, Five Laws of Library Science (Madras, 1931)

Does this apply to data?

If not, why not? If so, what are the implications?

What constitutes the book?

What is it that you want to use? What is it you want to preserve for the user?

Typography?

Chemical composition?

Manufacturing process?

Marginalia?

Intellectual content?

Linguistic content? (To wit ... ?)

How long do things last?

How old is the average artefact studied by a humanist?

How old is the average secondary literature consulted by a humanist?

How old is the average software package in use today?

Emulation, migration, normalization

emulation: keep the original data / software, emulate (i.e. fake) the environment?
migration: when the original object is in danger of obsolescence, translate it into a newer equivalent.
normalization: on ingestion, translate the object into a stable non-proprietary form. Cf. ICPSR's (past) use of Osiris.

Which approach, when?

Do digital objects suffer generational degeneration?

Second-generation photocopies are less clear.

Ditto second-generation photographs, copies of drawings, manuscript transcriptions of texts.

Are digital objects free of this degeneration?

Yes, digital objects suffer generational degeneration

Straight copies are usually exempt.

Every format conversion involves a better or worse match; most are potentially lossy.

Every reader [their] book

Every reader [their] book.

-Ranganathan, Law 2

How do you achieve this for books / traditional media?

How could you achieve this for digital materials? Are there things you can do for digital that you can't do for paper?

What would you have to have in order to achieve this for digital materials?

Every book its reader

Every book its reader.

-Ranganathan, Law 3

How do you achieve this for books / traditional media?

How could you achieve this for digital materials?

What would you have to have in order to achieve this for digital materials?

Documentation, documentation, documentation

Metadata.

Inline? Yes, please! Remember the help desk.

Stand-off / external? Yes, please! Remember the catalog manager.

For printed books (non-incunabula) we have metadata both inline and externally. Is this a problem?

Information discovery begins at Google

Within the next five years, all information discovery will begin at Google, including discovery of library materials. The continuing disaggregation of content from its original container will cause a revolution in resource discovery.

-Taiga Forum Provocative Statements, 10 March 2006

Is this true?

Should it be?

Are there alternatives?

Save the time of the reader

Save the time of the reader.

-Ranganathan, Law 4

Does this relate to data curation at all?

Does saving the time of the data curator count, indirectly?

When automation turns bad

From Dubin 2006:

<dublin_core>
  <dcvalue element="contributor" qualifier="none"
    >Scanning, indexing, and description 
    sponsored by the Illinois State Library and
    the University of Illinois at Urbana-Champaign 
    Library. Geo-referencing sponsored and 
    performed by the Geographic Modeling Systems 
    Laboratory, University of Illinois at 
    Urbana-Champaign.</dcvalue>
  <dcvalue element="contributor" 
    qualifier="author"
    >United States. 
    Agricultural Adjustment Agency.</dcvalue>
  <dcvalue element="contributor" 
    qualifier="author"
    >Aerial Photographs</dcvalue>

There appears to be an error here (“Aerial Photographs” is apparently not the corporate or individual name of an author of this photograph).

Can this be avoided?

How did this happen?

Dubin explains:

An obvious interpretation is that this is simple tag abuse or human error, but the history of this description reveals it to be an example of a more general and complicated problem. This is the latest in a series of descriptions each derived from an earlier version:
A paper description accompanied the original photograph, which had been taken in 1938.

In 1998 the photograph was scanned for inclusion in an image database made available on the web [grainger99]. A metadata record for the photograph was entered into a relational database. The fields for that database were derived from the FGDC [Federal Geographic Data Committee] Content Standard for Digital Geospatial Metadata [fgdc98].
In May of 2005 an OAI 2.0 metadata record was derived from that database entry, via a mapping from the database fields into Dublin Core.
Several months later the OAI record was transformed via XSLT [XSL Transformations] into a form suitable for ingestion into a DSpace installation.
When the record was exported from DSpace, additonal DC [Dublin Core] metadata statements had been automatically added.

Verifying emulation, migration, normalization

How do you know when an emulator is working correctly?

How do you know when a migration program has produced a correct result? How do you detect errors in migration?

Methods of verification and quality control

Naked human eyeballs.

Automated processes:

validation
supravalidation
false-color proofs

Semi-automated human intervention (padded cells).

The 1, 10, 100 practice.

Formal methods of verification?

Identify the meaning of old record.
Identify the meaning of new record.
Compare.
- Same? ⇒ OK.
- Different? ⇒ NOT OK.

Identifying the meaning of markup

For each construct, define skeleton sentences in (English or) symbolic logic.

Skeleton sentences can contain blanks.

For each blank, a deictic expression says how to fill it in.
For each instance of the construct, generate an instance sentence from each skeleton sentence.
Optionally generate further inferences.

Identifying the meaning of markup

A simple example

From a formalization of the OAI-PMH vocabulary:

oai2:OAI-PMH: (∃ q : OAI-request) (∃ r : OAI-response) (∃ s : OAI-server) (∃ t : moment)
( q = (℩ q : OAI-request)(models({ ./oai:request }, q))
    ∧ s = (℩ s : OAI-server)(uri_server({ string(./oai:request) }, s))
    ∧ t = (℩ t : moment)(xsd_lv(xsd:dateTime, { string(./oai:responseDate) }, t))
    ∧ (∀ x)(uri_server(x, s) ⇒ x = { string(./oai:request) })
    ∧ { . } = r
    ∧ served_response(q,s,t,r))
oai2:error: (∃ q : OAI-request) (models({ preceding-sibling::oai:request }, q)
∧ invalid(q)
∧ request_error(q, { string(@code) })
∧ ({ string(.) } ≠ "" ⇒ error_nldesc(q, { string(.) }) ) ) )
oai2:GetRecord: (∃ q : OAI-request) (∃ s : oai-server) (∃ d : string) (∃ i : oai-item) (∃ p : string)
(q = (℩ q : OAI-request)(req_resp(q,{ ancestor::oai:OAI-PMH }))
    s = (℩ s : oai-server)(resp_server({ ancestor::oai:OAI-PMH }, s))
    d = (℩ d : string)(request_identifier(q, d))
    i = (℩ i : oai-item)(item_id(i, d))
    p = (℩ p : string(p)(request_metadataPrefix(q, p))
∧ request_verb(q, "GetRecord")
∧ errorfree(q)
∧ isin_repository_item(s, i)
∧ hasformat_repository_item_format(s, i, p) )

If you can't prevent degeneration, detect it

Before: ...

After: ... (∃ p)((person(p) ∨ organization(p) ∧ name(p,“Aerial photographs”) ...

Is there a reason these are different?

The library is a growing organism

The library is a growing organism.

-Ranganathan, Law 5

Is this true w.r.t. digital materials? Illlustrate.