Bifocal data
A problem and some solutions
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
16 February 2018
http://blackmesatech.com/2018/02/London/
Overview
- Bifocal data
- a problem (or a class of problems)
- some examples
- Approaches to dealing with bifocal data
- duplication (uncontrolled redundancy)
- constrained duplication (controlled redundancy)
- single-source
- dominant/recessive transformations
- Some relevant TEI constructs
Bifocal data
- a problem (or a class of problems)
- some examples
Bifocality (the problem) 1
Three kinds of objects to digitize:
- texts
- artefacts
- texts describing artefacts
where both text and artefact are of interest
Bifocality (the problem) 2
More abstractly, the problem of
bifocal data
arises when:
- We have an object 1
- ... representing or describing an object 2
- ... and both are of scholarly interest.
E.g. objects described by Sloane's catalog, and catalog itself.
Early dictionaries
Early print dictionaries:
- rich source of linguistic information
- complex information encoding
- cultural artefacts in own right
- but as a lexical database ... very very inconsistent
(Actually, this is true of
any print dictionary.)
Print dictionary and lexical db
Johnson accents
oathbreaking on syllable 2:
We want to
- preserve his accentuation but also show pronunciation in IPA
(ˈəʊθˌbreɪkɪŋ)
- record that “n. ſ.” = noun
- record that “Shakesp.” = Shakespeare
Documents
Documents (whether incunabula, manuscripts, modern printed books,
or ...):
- witness a text
- document language usage
- are cultural artefacts in own right
- structure of book ≠ structure of text ~ structure of sentence(s)
- evidentiary value tied up with physical
structure and nature
A Central Asian manuscript
Note: deletion + addition; marginalia; verse structure ≠ sentence structure.
A simpler example
Toy illustration for conflicting structures. Two utterances, two sentences:
Peter: Hey Paul!
Would you give me
Paul: the hammer?
Levels of description
Linguists may care about
- orthography in a document
- conventional orthography
- phonemic form
- morphological features, structure
- lexical items
- syntactic structure
- ...
Multiple linguistic levels (example)
A sentence with orthographic, phonemic, morphologically segmented
representations, parts of speech, morpheme glosses, and sentence
gloss.
Multiple linguistic levels (cont'd)
Closeup of the segmented representation.
Textual criticism
A manuscript tradition witnesses a text.
We may be interested in:
- individual manuscripts
- interrelation of witnesses (dependencies)
- text of archetype
- text of author's original
- ...
These are all conceptually distinct.
Historical analysis
Various documents provide evidence about a trial.
What happened?
Who was charged with what? Who prosecuted? Who was the judge? What was the outcome?
Why do we think that that is what happened?
What evidence identifies defendant? charge? prosecutor? date? outcome?
What is the correct text of each source?
What documents witness the text? ...
These are all conceptually distinct. But we need all of
them. (Cf. Manfred Thaller.)
Analysis
Are these actually all the same problem?
A set of overlapping phenomena:
- multiple structures with shared content (overlap)
- multiple structures with some shared and some unshared content
- multiple notations / representations of
‘same’ thing
- same / different / overlapping content
- same / different ordering
Approaches
- selection
- duplication of documents (uncontrolled redundancy)
- constrained duplication (controlled redundancy)
- single organization with redundancy at leaves
- multiple organizations with shared substructures
- single-source
- view materialization (one ‘master’ view,
others derived)
- dominant/recessive transformations
N.B. Not completely systematic.
Selection
Choose a view; let the rest go. Summarize facts, link to
sources.
(Hypertext is your friend.)
Duplication (uncontrolled redundancy)
Just make as many digital resources as you want.
- print dictionary and lexical database
- one digitization for the text,
another for the document
- orthographic and phonemic (phonetic, ...) transcription
- one digitization for each witness
Problems with uncontrolled redundancy
- Updates are costly.
So? Who cares?
- Errors need multiple correction.
- Inconsistency creeps in.
- Searching across documents hard.
- Find all words containing /æ/ not spelled
“a”.
- Find all verses where “riter” in
Ms. A opposes “recke” in C.
- Find all verses where A and C both have “riter”.
- ...
Shared global structures, local redundancy
Where
- global structures are shared
- fine-grained content (and structure?) differs
it's easy to place alternatives next to each other.
Local redundancy.
Local redundancy in dictionaries (1)
A straight encoding of Johnson's entry:
<entry>
<form>
<orth>oathbrea′king</orth>
</form>
<gramGrp>
<pos>n. s.</pos>
</gramGrp>
<etym>
<mentioned>oath</mentioned> and <mentioned>break</mentioned>.
</etym>
<sense>
<def>Perjury; the violation of an oath.</def>
<cit type="example">
<q>
<l rend="indent">His <oRef/> he mended thus,</l>
<l>By now forswearing that he is forsworn.</l>
</q>
<bibl><author>Shakesp.</author></bibl>
</cit>
</sense>
</entry>
Local redundancy in dictionaries (2)
Adding pronunciation, marking as
noun, identifying Shakespeare:
<entry>
<form>
<orth>oathbrea′king</orth>
<pron value="ˈəʊθˌbreɪkɪŋ"/>
</form>
<gramGrp>
<pos norm="noun">n. s.</pos>
</gramGrp>
<etym>
<mentioned>oath</mentioned> and <mentioned>break</mentioned>.
</etym>
<sense>
<def>Perjury; the violation of an oath.</def>
<cit type="example">
<q>
<l rend="indent">His <oRef/> he mended thus,</l>
<l>By now forswearing that he is forsworn.</l>
</q>
<bibl><author key="Shakespeare">Shakesp.</author></bibl>
</cit>
</sense>
</entry>
Local redundancy in dictionaries (3)
If we use numeric character references, ...
<entry>
<form>
<orth>oathbrea′king</orth>
<pron value="ˈəʊθˌbreɪkɪŋ"/>
</form>
<gramGrp>
<pos norm="noun">n. s.</pos>
</gramGrp>
<etym>
<mentioned>oath</mentioned> and <mentioned>break</mentioned>.
</etym>
<sense>
<def>Perjury; the violation of an oath.</def>
<cit type="example">
<q>
<l rend="indent">His <oRef/> he mended thus,</l>
<l>By now forswearing that he is forsworn.</l>
</q>
<bibl><author key="Shakespeare">Shakesp.</author></bibl>
</cit>
</sense>
</entry>
Local redundancy in tiered linguistic annotation
Fragment of a Uyghur sentence:
<s lang="uig" who="Mejit Axun" ref="1">
<w>
<ow>
<m>
<orth>Aɫti</orth>
<seg>aɫtʰi</seg>
<pos>NU</pos>
<ilg>six</ilg>
</m>
</ow>
<ow>
<m>
<seg>ʃɛːr</seg>
<pos>N</pos>
<ilg>city</ilg>
</m>
</ow>
</w>
<w>
<m>
<seg>dɛ</seg>
<pos>Vt</pos>
<ilg>say</ilg>
</m>
<m>
<seg>gɛn</seg>
<pos>PRTC.RZR</pos>
<ilg>PRTC.RZR</ilg>
</m>
</w>
<w>
<m>
<seg>ʃɛːr</seg>
<pos>N</pos>
<ilg>city</ilg>
</m>
<m>
<seg>lɛr</seg>
<pos>PL</pos>
<ilg>PL</ilg>
</m>
</w>
<w>
<m>
<seg>ʃu</seg>
<pos>PN.DEM</pos>
<ilg>this</ilg>
</m>
</w>
...
Multiple organizations with shared substructures
Standard form of ‘overlap problem’.
Some well understood approaches:
- CONCUR
- Trojan Horse markup
- Standoff markup
Peter and Paul — two structures
CONCUR
Mark multiple structures, each labeled.
<!DOCTYPE div SYSTEM "tei/dtd/teispok2.dtd">
<!DOCTYPE text SYSTEM "tei/dtd/teiana2.dtd">
<(div)div type="dialog" org="uniform">
<(text)text>
<(div)u who="Peter">
<(text)s>Hey Paul!</(text)s>
<(text)s>Would you give me
</(div)u>
<(div)u who="Paul">
the hammer?</(text)s>
</(div)u>
</(text)text>
</(div)div>
CONCUR (2)
Plus:
- easy to understand
- easy to process one tree at a time
- easy to validate (DTDs, rabbit/duck grammars)
Minus:
- not widely implemented (but easy to implement in XML)
- desired editor behavior unclear*
- what data structure to use?
Trojan Horse markup
When an element doesn't fit, tag its start and end with empty
elements (here second <
s>):
<div type="dialog" org="uniform">
<u who="Peter">
<s>Hey Paul!</s>
<s sID="s2"/>Would you give me
</u>
<u who="Paul">
the hammer?<s eID="s2"/>
</u>
</div>
Standoff markup
Point in to the data from elsewhere.
<documentset>
<div type="dialog" org="uniform">
<u who="Peter">
<seg xml:id="t1">Hey Paul!</seg>
<seg xml:id="t2">Would you give me</seg>
</u>
<u who="Paul">
<seg xml:id="t3">the hammer?</seg>
</u>
</div>
<div type="standoff s">
<text>
<s xml:id="s1">
<mirror target="#t1"/>
</s>
<s xml:id="s2">
<mirror target="#t2 #t3"/>
</s>
</text>
</div>
</documentset>
Standoff markup (bis)
The TEI <
join> element is designed specifically for this.
<documentset>
<div type="dialog" org="uniform">
<u who="Peter">
<seg xml:id="t1">Hey Paul!</seg>
<seg xml:id="t2">Would you give me</seg>
</u>
<u who="Paul">
<seg xml:id="t3">the hammer?</seg>
</u>
</div>
<div type="standoff s">
<text>
<join xml:id="s1" result="s"
target="#t1"/>
<join xml:id="s2" result="s"
target="#t2 #t3"/>
</text>
</div>
</documentset>
Single-source (view materialization)
Define a ‘master view’ with all information.
(Often complex!)
Generate task-specific views by filtering information out
- at need (dynamically)
- in advance (statically)
Challenge of view materialization
If the task-specific view is read-only, fine.
If it generates / edits information ...
... you have to get the new info into the
master view.
Dominant/recessive transformations
Define two Trojan-Horse representations:
- View A dominant (with B as milestones)
- View B dominant (with A as milestones)
Transform back and forth.
Dominant/recessive transformations (example)
Utterance dominant:
<div xml:id="d1" type="dialog" org="uniform">
<text sID="t1"/>
<u xml:id="u1" who="Peter">
<s sID="s1"/>Hey Paul!<s eID="s1"/>
<s sID="s2"/>Would you give me
</u>
<u xml:id="u2" who="Paul">
the hammer?<s eID="s2"/>
</u>
<text eID="t1"/>
</div>
This is essentially CONCUR using Trojan-Horse markup for
non-primary element structures.
Sentence dominant
Turning the tables:
<text xml:id="t1">
<div sID="d1" type="dialog" org="uniform"/>
<u sID="u1" who="Peter"/>
<s xml:id="s1">Hey Paul!</s>
<s sID="s2">Would you give me
<u eID="u1"/>
<u sID="u2" who="Paul"/>
the hammer?</s>
<u eID="u2"/>
<div eID="d1"/>
</text>
Challenges of dominant/recessive representations
- Validation
- Schema development
Some TEI constructs
- milestone elements
- extension: Trojan-horse markup
- analysis and correspondence links
- stand-off annotation
- feature structures
- other annotation
- <join> element