Sending Messages to the Future
Preservation Aspects of Data Representation
C. M. Sperberg-McQueen, Black Mesa Technologies
Summer XML 2009, 28 July 2009
Overview
Overview
- Why care?
- Preservation as one-way messages
- A threat model
- Preserving meaning
- Conclusions
Why care?
What do we want from digital resources?
- Ease of use
- Reusability
- Sharing
- Secondary analysis
- Historical record
- Permanence
Preservation as one-way messages
The future is another country
The future is a foreign country; they do things
differently there.
-After L.P.Hartley
Interoperation across temporal boundaries
≃
interoperation across geographic boundaries.
All the usual rules of device independence
and application independence apply.
Operating with minimal feedback
- Recipient is the future
- No “Message garbled; please re-transmit” replies
- So ... we need to get it right.
A threat model
Failures of the expressive function
First failure point: sender ...
- fails to project
- fails to say what they mean to say (poor modeling)
- loses or deletes the data
Failures of the conative function
Second point of failure: recipient ...
- does not listen
- listens but does not care
- cares but does not understand
N.B. We do not know who the recipient is, in detail.
Using the message to ask the recipient to do something
is ... unreliable, and possibly dangerous.
Stick to description of facts.
Phatic failures
Failures of the channel:
- Bad media
- Obsolete hardware
- Obsolete software
- Disappearing repositories?
- Invisible repositories?
Emulation, migration, normalization
- emulation: keep the original data / software,
emulate (i.e. fake) the environment?
- migration: when the original object is in
danger of obsolescence, translate it into a newer equivalent.
- normalization: on ingestion, translate the
object into a stable non-proprietary form. Cf. ICPSR's (past)
use of Osiris; JSTOR; NLM's PubMed Central.
Which approach, when?
Do digital objects suffer generational degeneration?
Second-generation photocopies are less clear.
Ditto second-generation photographs, copies of drawings,
manuscript transcriptions of texts.
Are digital objects free of this degeneration?
Yes, digital objects suffer generational degeneration
Straight copies are usually exempt.
Every format conversion involves a better or worse match;
most are potentially lossy.
Metalinguistic failures
Failures in the code:
- Character set problems
- Data format problems
- proprietary formats
- ad-hoc (homebrew) formats
- unconventional use of public formats
- semantic failures
Safeguarding semantics
Semantic failures
A rich source of problems.
- lightning bugs
- failure to grasp full meaning and import
- failure to understand what the fields / elements / attributes / records
mean
- accumulating errors of translation
When automation turns bad
<dublin_core>
<dcvalue element="contributor" qualifier="none"
>Scanning, indexing, and description
sponsored by the Illinois State Library and
the University of Illinois at Urbana-Champaign
Library. Geo-referencing sponsored and
performed by the Geographic Modeling Systems
Laboratory, University of Illinois at
Urbana-Champaign.</dcvalue>
<dcvalue element="contributor"
qualifier="author"
>United States.
Agricultural Adjustment Agency.</dcvalue>
<dcvalue element="contributor"
qualifier="author"
>Aerial Photographs</dcvalue>
There appears to be an error here (“Aerial Photographs”
is apparently not the corporate or individual name of an author
of this photograph).
Can this be avoided?
How did this happen?
Dubin explains:
An obvious interpretation is that this is simple
tag abuse or human error, but the history of this description reveals
it to be an example of a more general and complicated problem. This is
the latest in a series of descriptions each derived from an earlier
version:
- A paper description accompanied the original photograph, which had been taken in 1938.
- In 1998 the photograph was scanned for inclusion in an image database made available on the web [grainger99]. A metadata record for the photograph was entered into a relational database. The fields for that database were derived from the FGDC [Federal Geographic Data Committee] Content Standard for Digital Geospatial Metadata [fgdc98].
- In May of 2005 an OAI 2.0 metadata record was derived from that database entry, via a mapping from the database fields into Dublin Core.
- Several months later the OAI record was transformed via XSLT [XSL Transformations] into a form suitable for ingestion into a DSpace installation.
- When the record was exported from DSpace, additional DC [Dublin Core] metadata statements had been automatically added.
Verifying emulation, migration, normalization
How do you know when an emulator is working correctly?
How do you know when a migration program has produced a correct result?
How do you detect errors in migration?
Methods of verification and quality control
Naked human eyeballs.
Automated processes:
- validation
- supravalidation
- false-color proofs
Semi-automated human intervention (padded cells).
The 1, 10, 100 practice.
Formal methods of verification?
- Identify the meaning of old record.
- Identify the meaning of new record.
- Compare.
- Same? ⇒ OK.
- Different? ⇒ NOT OK.
Identifying the meaning of markup
For each construct, define skeleton sentences
in (English or) symbolic logic.
Skeleton sentences can contain blanks.
For each blank, a deictic expression
says how to fill it in.
For each instance of the construct,
generate an instance sentence from each skeleton sentence.
Optionally generate further inferences.
Identifying the meaning of markup
A simple example
From a
formalization
of the OAI-PMH vocabulary:
- oai2:OAI-PMH
-
(∃ q : OAI-request)
(∃ r : OAI-response)
(∃ s : OAI-server)
(∃ t : moment)
( q = (℩ q : OAI-request)(models({ ./oai:request }, q))
∧ s = (℩ s : OAI-server)(uri_server({ string(./oai:request) }, s))
∧ t = (℩ t : moment)(xsd_lv(xsd:dateTime, { string(./oai:responseDate) }, t))
∧ (∀ x)(uri_server(x, s) ⇒ x = { string(./oai:request) })
∧ { . } = r
∧ served_response(q,s,t,r)) - oai2:error
- (∃ q : OAI-request)
(models({ preceding-sibling::oai:request }, q)
∧ invalid(q)
∧ request_error(q, { string(@code) })
∧ ({ string(.) } ≠ "" ⇒ error_nldesc(q, { string(.) }) )
)
)
- oai2:GetRecord
- (∃ q : OAI-request)
(∃ s : oai-server)
(∃ d : string)
(∃ i : oai-item)
(∃ p : string)
(q =
(℩ q : OAI-request)(req_resp(q,{ ancestor::oai:OAI-PMH }))
s =
(℩ s : oai-server)(resp_server({ ancestor::oai:OAI-PMH }, s))
d =
(℩ d : string)(request_identifier(q, d))
i =
(℩ i : oai-item)(item_id(i, d))
p =
(℩ p : string(p)(request_metadataPrefix(q, p))
∧ request_verb(q, "GetRecord")
∧ errorfree(q)
∧ isin_repository_item(s, i)
∧ hasformat_repository_item_format(s, i, p)
)
If you can't prevent degeneration, detect it
Before: ... blah blah blah ...
After: ... blah blah
(∃ p)((person(p) ∨ organization(p))
∧ name(p,“Aerial photographs”))
blah ...
Is there a reason these are different?
Morals
- Plan for deposit, not just dissemination. It does take time.
And money.
- Say what you mean; mean what you say.
- Make sure the recipient can find the message.
- Make sure the recipient can know what the message is about.
- Use long-lived media.
- Arrange for institutional support of preservation tasks.
- Document character set and numeric formats.
- Document higher-level formats.
Preserving semantics
- Think about what you wish to say.
- Design the vocabulary (the format) carefully,
to make instances easy to understand.
- Document the vocabulary and your usage.
- Avoid (undocumented) tag abuse.
- Provide and document ancillary materials.
- Validate early and often.
- Verify early and often.