Bifocal data
A problem and some solutions
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
16 February 2018
http://blackmesatech.com/2018/02/London/
 
Overview
- Bifocal data- a problem (or a class of problems)
- some examples
 
- Approaches to dealing with bifocal data- duplication (uncontrolled redundancy)
- constrained duplication (controlled redundancy)
- single-source
- dominant/recessive transformations
 
- Some relevant TEI constructs
 
Bifocal data
- a problem (or a class of problems)
- some examples
 
Bifocality (the problem) 1
Three kinds of objects to digitize:
- texts
- artefacts
- texts describing artefacts
  where both text and artefact are of interest
 
Bifocality (the problem) 2
More abstractly, the problem of 
bifocal data
arises when:
- We have an object 1 
- ... representing or describing an object 2
- ... and both are of scholarly interest.
E.g. objects described by Sloane's catalog, and catalog itself.
 
Early dictionaries
Early print dictionaries:
- rich source of linguistic information
- complex information encoding
- cultural artefacts in own right
- but as a lexical database ... very very inconsistent
(Actually, this is true of 
any print dictionary.)
 
Print dictionary and lexical db
Johnson accents 
oathbreaking on syllable 2:

We want to 
- preserve his accentuation but also show pronunciation in IPA 
(ˈəʊθˌbreɪkɪŋ)
- record that “n. ſ.” = noun
- record that “Shakesp.” = Shakespeare
 
Documents
Documents (whether incunabula, manuscripts, modern printed books,
or ...):
- witness a text
- document language usage
- are cultural artefacts in own right
- structure of book ≠ structure of text ~ structure of sentence(s)
- evidentiary value tied up with physical
  structure and nature
 
A Central Asian manuscript
Note:  deletion + addition; marginalia; verse structure ≠ sentence structure.

 
A simpler example
Toy illustration for conflicting structures.  Two utterances, two sentences:
  
Peter: Hey Paul!
       Would you give me
       
  Paul: the hammer?
    
 
Levels of description
Linguists may care about
- orthography in a document
- conventional orthography
- phonemic form
- morphological features, structure
- lexical items
- syntactic structure
- ...
 
Multiple linguistic levels (example)
A sentence with orthographic, phonemic, morphologically segmented
representations, parts of speech, morpheme glosses, and sentence
gloss.

 
Multiple linguistic levels (cont'd)
Closeup of the segmented representation.

 
Textual criticism
A manuscript tradition witnesses a text.
We may be interested in:
- individual manuscripts
- interrelation of witnesses (dependencies)
- text of archetype
- text of author's original
- ...
These are all conceptually distinct.
 
Historical analysis
Various documents provide evidence about a trial.
- What happened? - Who was charged with what?  Who prosecuted?  Who was the judge? What was the outcome? 
- Why do we think that that is what happened? - What evidence identifies defendant? charge? prosecutor? date? outcome? 
- What is the correct text of each source? - What documents witness the text? ... 
These are all conceptually distinct.  But we need all of
them. (Cf. Manfred Thaller.)
 
Analysis
Are these actually all the same problem?
A set of overlapping phenomena:
  
- multiple structures with shared content (overlap)
- multiple structures with some shared and some unshared content
- multiple notations / representations of
    ‘same’ thing
- same / different / overlapping content
- same / different ordering
 
Approaches
- selection
- duplication of documents (uncontrolled redundancy)
- constrained duplication (controlled redundancy)
  - single organization with redundancy at leaves
- multiple organizations with shared substructures
 
- single-source
  - view materialization (one ‘master’ view,
    others derived)
- dominant/recessive transformations
 
N.B. Not completely systematic.
 
Selection
Choose a view; let the rest go.  Summarize facts, link to
  sources.
  (Hypertext is your friend.)
  

   
Duplication (uncontrolled redundancy)
Just make as many digital resources as you want.
- print dictionary and lexical database
- one digitization for the text,
    another for the document
- orthographic and phonemic (phonetic, ...) transcription
- one digitization for each witness
 
Problems with uncontrolled redundancy
- Updates are costly.
    So?  Who cares?
- Errors need multiple correction.
- Inconsistency creeps in.
- Searching across documents hard.- Find all words containing /æ/ not spelled
    “a”.
- Find all verses where “riter” in
    Ms. A opposes “recke” in C.
- Find all verses where A and C both have “riter”.
- ...
 
 
Shared global structures, local redundancy
Where
  
- global structures are shared
- fine-grained content (and structure?) differs
  it's easy to place alternatives next to each other.
Local redundancy.
 
Local redundancy in dictionaries (1)
A straight encoding of Johnson's entry:
  
	<entry>
	  <form>
	    <orth>oathbrea′king</orth>
	  </form>
	    
	  <gramGrp>
	    <pos>n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author>Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>
   
Local redundancy in dictionaries (2)
Adding pronunciation, marking as 
noun, identifying Shakespeare: 
  
	<entry>
	  <form>
	    <orth>oathbrea′king</orth>
	    <pron value="ˈəʊθˌbreɪkɪŋ"/>
	  </form>
	  <gramGrp>
	    <pos norm="noun">n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author key="Shakespeare">Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>
   
Local redundancy in dictionaries (3)
If we use numeric character references, ... 
  
	<entry>
	  <form>
	    <orth>oathbrea′king</orth>
	    <pron value="ˈəʊθˌbreɪkɪŋ"/>
	  </form>
	  <gramGrp>
	    <pos norm="noun">n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author key="Shakespeare">Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>
   
Local redundancy in tiered linguistic annotation
Fragment of a Uyghur sentence:
  
      <s lang="uig" who="Mejit Axun" ref="1">
        <w>
          <ow>
            <m>
              <orth>Aɫti</orth>
              <seg>aɫtʰi</seg>
              <pos>NU</pos>
              <ilg>six</ilg>
            </m>
            </ow>
            <ow>
              <m>
                <seg>ʃɛːr</seg>
                <pos>N</pos>
                <ilg>city</ilg>
              </m>
            </ow>
         </w>
         <w>
           <m>
             <seg>dɛ</seg>
             <pos>Vt</pos>
             <ilg>say</ilg>
           </m>
           <m>
             <seg>gɛn</seg>
             <pos>PRTC.RZR</pos>
             <ilg>PRTC.RZR</ilg>
           </m>
         </w>
         <w>
           <m>
             <seg>ʃɛːr</seg>
             <pos>N</pos>
             <ilg>city</ilg>
           </m>
           <m>
             <seg>lɛr</seg>
             <pos>PL</pos>
             <ilg>PL</ilg>
           </m>
         </w>
         <w>
           <m>
             <seg>ʃu</seg>
             <pos>PN.DEM</pos>
             <ilg>this</ilg>
           </m>
           </w>
            ...
   
Multiple organizations with shared substructures
Standard form of ‘overlap problem’.
Some well understood approaches:
  
- CONCUR
- Trojan Horse markup
- Standoff markup
 
Peter and Paul — two structures
 
CONCUR
Mark multiple structures, each labeled.
  
<!DOCTYPE div SYSTEM "tei/dtd/teispok2.dtd">
<!DOCTYPE text SYSTEM "tei/dtd/teiana2.dtd">
<(div)div type="dialog" org="uniform">
  <(text)text>
     <(div)u who="Peter">
       <(text)s>Hey Paul!</(text)s>
       <(text)s>Would you give me
     </(div)u>
     <(div)u who="Paul">
       the hammer?</(text)s>
     </(div)u>
  </(text)text>
</(div)div> 
CONCUR (2)
Plus:  
- easy to understand
- easy to process one tree at a time
- easy to validate (DTDs, rabbit/duck grammars)
Minus:  
- not widely implemented (but easy to implement in XML)
- desired editor behavior unclear*
- what data structure to use?
 
Trojan Horse markup
When an element doesn't fit, tag its start and end with empty
  elements (here second <
s>):
  
<div type="dialog" org="uniform">
  <u who="Peter">
    <s>Hey Paul!</s>
    <s sID="s2"/>Would you give me
  </u>
  <u who="Paul">
    the hammer?<s eID="s2"/>
  </u>
</div> 
Standoff markup
Point in to the data from elsewhere.
  
<documentset>
  <div type="dialog" org="uniform">    
    <u who="Peter">
      <seg xml:id="t1">Hey Paul!</seg>
      <seg xml:id="t2">Would you give me</seg>
    </u>
    <u who="Paul">
      <seg xml:id="t3">the hammer?</seg>
    </u>
  </div>
  <div type="standoff s">
    <text>      
      <s xml:id="s1">
	<mirror target="#t1"/>
      </s>
      <s xml:id="s2">
	<mirror target="#t2 #t3"/>
      </s>
    </text>
  </div>
</documentset> 
Standoff markup (bis)
The TEI <
join> element is designed specifically for this.
  
<documentset>
  <div type="dialog" org="uniform">    
    <u who="Peter">
      <seg xml:id="t1">Hey Paul!</seg>
      <seg xml:id="t2">Would you give me</seg>
    </u>
    <u who="Paul">
      <seg xml:id="t3">the hammer?</seg>
    </u>
  </div>
  <div type="standoff s">
    <text>
      <join xml:id="s1" result="s"
	    target="#t1"/>
      <join xml:id="s2" result="s"
	    target="#t2 #t3"/>
    </text>
  </div>
</documentset>
 
Single-source (view materialization)
Define a ‘master view’ with all information.
(Often complex!)
Generate task-specific views by filtering information out
  
- at need (dynamically)
- in advance (statically)
 
Challenge of  view materialization
If the task-specific view is read-only, fine.
If it generates / edits information  ...
... you have to get the new info into the
  master view.
 
Dominant/recessive transformations
Define two Trojan-Horse representations:
  
- View A dominant (with B as milestones)
- View B dominant (with A as milestones)
  Transform back and forth.
 
Dominant/recessive transformations (example)
Utterance dominant:
  
<div xml:id="d1" type="dialog" org="uniform">
  <text sID="t1"/>
  <u xml:id="u1" who="Peter">
  <s sID="s1"/>Hey Paul!<s eID="s1"/>
  <s sID="s2"/>Would you give me
  </u>
  <u xml:id="u2" who="Paul">
    the hammer?<s eID="s2"/>
  </u>
  <text eID="t1"/>
</div>
This is essentially CONCUR using Trojan-Horse markup for
  non-primary element structures.
 
Sentence dominant
Turning the tables:
  
<text xml:id="t1">
  <div sID="d1" type="dialog" org="uniform"/>
  <u sID="u1" who="Peter"/>
  <s xml:id="s1">Hey Paul!</s>
  <s sID="s2">Would you give me
  <u eID="u1"/>
  <u sID="u2" who="Paul"/>
    the hammer?</s>
  <u eID="u2"/>
  <div eID="d1"/>
</text>
 
Challenges of dominant/recessive representations
- Validation
- Schema development
 
Some TEI constructs
- milestone elements
- extension: Trojan-horse markup
- analysis and correspondence links
- stand-off annotation
  - feature structures
- other annotation
- <join> element