Bifocal data

A problem and some solutions

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

16 February 2018

http://blackmesatech.com/2018/02/London/


Overview

Bifocal data

Bifocality (the problem) 1

Three kinds of objects to digitize:
  • texts
  • artefacts
  • texts describing artefacts where both text and artefact are of interest

Bifocality (the problem) 2

More abstractly, the problem of bifocal data arises when:
  • We have an object 1
  • ... representing or describing an object 2
  • ... and both are of scholarly interest.
E.g. objects described by Sloane's catalog, and catalog itself.

Early dictionaries

Early print dictionaries:
  • rich source of linguistic information
  • complex information encoding
  • cultural artefacts in own right
  • but as a lexical database ... very very inconsistent
(Actually, this is true of any print dictionary.)

Print dictionary and lexical db

Johnson accents oathbreaking on syllable 2:

We want to
  • preserve his accentuation but also show pronunciation in IPA (ˈəʊθˌbreɪkɪŋ)
  • record that “n. ſ.” = noun
  • record that “Shakesp.” = Shakespeare

Documents

Documents (whether incunabula, manuscripts, modern printed books, or ...):
  • witness a text
  • document language usage
  • are cultural artefacts in own right
  • structure of book ≠ structure of text ~ structure of sentence(s)
  • evidentiary value tied up with physical structure and nature

A Central Asian manuscript

Note: deletion + addition; marginalia; verse structure ≠ sentence structure.

A simpler example

Toy illustration for conflicting structures. Two utterances, two sentences:

Peter: Hey Paul! Would you give me
Paul: the hammer?

Levels of description

Linguists may care about
  • orthography in a document
  • conventional orthography
  • phonemic form
  • morphological features, structure
  • lexical items
  • syntactic structure
  • ...

Multiple linguistic levels (example)

A sentence with orthographic, phonemic, morphologically segmented representations, parts of speech, morpheme glosses, and sentence gloss.

Multiple linguistic levels (cont'd)

Closeup of the segmented representation.

Textual criticism

A manuscript tradition witnesses a text. We may be interested in:
  • individual manuscripts
  • interrelation of witnesses (dependencies)
  • text of archetype
  • text of author's original
  • ...
These are all conceptually distinct.

Historical analysis

Various documents provide evidence about a trial.
  • What happened?
    Who was charged with what? Who prosecuted? Who was the judge? What was the outcome?
  • Why do we think that that is what happened?
    What evidence identifies defendant? charge? prosecutor? date? outcome?
  • What is the correct text of each source?
    What documents witness the text? ...
These are all conceptually distinct. But we need all of them. (Cf. Manfred Thaller.)

Analysis

Are these actually all the same problem?
A set of overlapping phenomena:
  • multiple structures with shared content (overlap)
  • multiple structures with some shared and some unshared content
  • multiple notations / representations of ‘same’ thing
  • same / different / overlapping content
  • same / different ordering

Approaches

N.B. Not completely systematic.

Selection

Choose a view; let the rest go. Summarize facts, link to sources. (Hypertext is your friend.)

Duplication (uncontrolled redundancy)

Just make as many digital resources as you want.

Problems with uncontrolled redundancy

Shared global structures, local redundancy

Where
  • global structures are shared
  • fine-grained content (and structure?) differs
it's easy to place alternatives next to each other.
Local redundancy.

Local redundancy in dictionaries (1)

A straight encoding of Johnson's entry:
	<entry>
	  <form>
	    <orth>oathbrea&#x2032;king</orth>
	  </form>
	    
	  <gramGrp>
	    <pos>n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author>Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>

Local redundancy in dictionaries (2)

Adding pronunciation, marking as noun, identifying Shakespeare:
	<entry>
	  <form>
	    <orth>oathbrea′king</orth>
	    <pron value="ˈəʊθˌbreɪkɪŋ"/>
	  </form>
	  <gramGrp>
	    <pos norm="noun">n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author key="Shakespeare">Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>

Local redundancy in dictionaries (3)

If we use numeric character references, ...
	<entry>
	  <form>
	    <orth>oathbrea&#x2032;king</orth>
	    <pron value="&#x02C8;&#x0259;&#x028A;&#x03B8;&#x02CC;bre&#x026A;k&#x026A;&#x014B;"/>
	  </form>
	  <gramGrp>
	    <pos norm="noun">n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author key="Shakespeare">Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>

Local redundancy in tiered linguistic annotation

Fragment of a Uyghur sentence:
      <s lang="uig" who="Mejit Axun" ref="1">
        <w>
          <ow>
            <m>
              <orth>Aɫti</orth>
              <seg>aɫtʰi</seg>
              <pos>NU</pos>
              <ilg>six</ilg>
            </m>
            </ow>
            <ow>
              <m>
                <seg>ʃɛːr</seg>
                <pos>N</pos>
                <ilg>city</ilg>
              </m>
            </ow>
         </w>
         <w>
           <m>
             <seg>dɛ</seg>
             <pos>Vt</pos>
             <ilg>say</ilg>
           </m>
           <m>
             <seg>gɛn</seg>
             <pos>PRTC.RZR</pos>
             <ilg>PRTC.RZR</ilg>
           </m>
         </w>
         <w>
           <m>
             <seg>ʃɛːr</seg>
             <pos>N</pos>
             <ilg>city</ilg>
           </m>
           <m>
             <seg>lɛr</seg>
             <pos>PL</pos>
             <ilg>PL</ilg>
           </m>
         </w>
         <w>
           <m>
             <seg>ʃu</seg>
             <pos>PN.DEM</pos>
             <ilg>this</ilg>
           </m>
           </w>
            ...
  

Multiple organizations with shared substructures

Standard form of ‘overlap problem’.
Some well understood approaches:
  • CONCUR
  • Trojan Horse markup
  • Standoff markup

Peter and Paul — two structures

CONCUR

Mark multiple structures, each labeled.
<!DOCTYPE div SYSTEM "tei/dtd/teispok2.dtd">
<!DOCTYPE text SYSTEM "tei/dtd/teiana2.dtd">
<(div)div type="dialog" org="uniform">
  <(text)text>
     <(div)u who="Peter">
       <(text)s>Hey Paul!</(text)s>
       <(text)s>Would you give me
     </(div)u>
     <(div)u who="Paul">
       the hammer?</(text)s>
     </(div)u>
  </(text)text>
</(div)div>

CONCUR (2)

Plus:
  • easy to understand
  • easy to process one tree at a time
  • easy to validate (DTDs, rabbit/duck grammars)
Minus:
  • not widely implemented (but easy to implement in XML)
  • desired editor behavior unclear*
  • what data structure to use?

Trojan Horse markup

When an element doesn't fit, tag its start and end with empty elements (here second <s>):
<div type="dialog" org="uniform">
  <u who="Peter">
    <s>Hey Paul!</s>
    <s sID="s2"/>Would you give me
  </u>
  <u who="Paul">
    the hammer?<s eID="s2"/>
  </u>
</div>

Standoff markup

Point in to the data from elsewhere.
<documentset>
  <div type="dialog" org="uniform">    
    <u who="Peter">
      <seg xml:id="t1">Hey Paul!</seg>
      <seg xml:id="t2">Would you give me</seg>
    </u>
    <u who="Paul">
      <seg xml:id="t3">the hammer?</seg>
    </u>
  </div>
  <div type="standoff s">
    <text>      
      <s xml:id="s1">
	<mirror target="#t1"/>
      </s>
      <s xml:id="s2">
	<mirror target="#t2 #t3"/>
      </s>
    </text>
  </div>
</documentset>

Standoff markup (bis)

The TEI <join> element is designed specifically for this.
<documentset>
  <div type="dialog" org="uniform">    
    <u who="Peter">
      <seg xml:id="t1">Hey Paul!</seg>
      <seg xml:id="t2">Would you give me</seg>
    </u>
    <u who="Paul">
      <seg xml:id="t3">the hammer?</seg>
    </u>
  </div>
  <div type="standoff s">
    <text>
      <join xml:id="s1" result="s"
	    target="#t1"/>
      <join xml:id="s2" result="s"
	    target="#t2 #t3"/>
    </text>
  </div>
</documentset>

Single-source (view materialization)

Define a ‘master view’ with all information.
(Often complex!)
Generate task-specific views by filtering information out
  • at need (dynamically)
  • in advance (statically)

Challenge of view materialization

If the task-specific view is read-only, fine.
If it generates / edits information ...
... you have to get the new info into the master view.

Dominant/recessive transformations

Define two Trojan-Horse representations:
  • View A dominant (with B as milestones)
  • View B dominant (with A as milestones)
Transform back and forth.

Dominant/recessive transformations (example)

Utterance dominant:
<div xml:id="d1" type="dialog" org="uniform">
  <text sID="t1"/>
  <u xml:id="u1" who="Peter">
  <s sID="s1"/>Hey Paul!<s eID="s1"/>
  <s sID="s2"/>Would you give me
  </u>
  <u xml:id="u2" who="Paul">
    the hammer?<s eID="s2"/>
  </u>
  <text eID="t1"/>
</div>
This is essentially CONCUR using Trojan-Horse markup for non-primary element structures.

Sentence dominant

Turning the tables:
<text xml:id="t1">
  <div sID="d1" type="dialog" org="uniform"/>
  <u sID="u1" who="Peter"/>
  <s xml:id="s1">Hey Paul!</s>
  <s sID="s2">Would you give me
  <u eID="u1"/>
  <u sID="u2" who="Paul"/>
    the hammer?</s>
  <u eID="u2"/>
  <div eID="d1"/>
</text>

Challenges of dominant/recessive representations

Some TEI constructs