Bifocal data

A problem and some solutions

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

16 February 2018

http://blackmesatech.com/2018/02/London/

Overview

Bifocal data
- a problem (or a class of problems)
- some examples
Approaches to dealing with bifocal data
- duplication (uncontrolled redundancy)
- constrained duplication (controlled redundancy)
- single-source
- dominant/recessive transformations
Some relevant TEI constructs

Bifocal data

a problem (or a class of problems)
some examples

Bifocality (the problem) 1

Three kinds of objects to digitize:

texts
artefacts
texts describing artefacts where both text and artefact are of interest

Bifocality (the problem) 2

More abstractly, the problem of bifocal data arises when:

We have an object 1
... representing or describing an object 2
... and both are of scholarly interest.

E.g. objects described by Sloane's catalog, and catalog itself.

Early dictionaries

Early print dictionaries:

rich source of linguistic information
complex information encoding
cultural artefacts in own right
but as a lexical database ... very very inconsistent

(Actually, this is true of any print dictionary.)

Print dictionary and lexical db

Johnson accents oathbreaking on syllable 2:

We want to

preserve his accentuation but also show pronunciation in IPA (ˈəʊθˌbreɪkɪŋ)
record that “n. ſ.” = noun
record that “Shakesp.” = Shakespeare

Documents

Documents (whether incunabula, manuscripts, modern printed books, or ...):

witness a text
document language usage
are cultural artefacts in own right
structure of book ≠ structure of text ~ structure of sentence(s)
evidentiary value tied up with physical structure and nature

A Central Asian manuscript

Note: deletion + addition; marginalia; verse structure ≠ sentence structure.

A simpler example

Toy illustration for conflicting structures. Two utterances, two sentences:

Peter: Hey Paul! Would you give me

Paul: the hammer?

Levels of description

Linguists may care about

orthography in a document
conventional orthography
phonemic form
morphological features, structure
lexical items
syntactic structure
...

Multiple linguistic levels (example)

A sentence with orthographic, phonemic, morphologically segmented representations, parts of speech, morpheme glosses, and sentence gloss.

Multiple linguistic levels (cont'd)

Closeup of the segmented representation.

Textual criticism

A manuscript tradition witnesses a text. We may be interested in:

individual manuscripts
interrelation of witnesses (dependencies)
text of archetype
text of author's original
...

These are all conceptually distinct.

Historical analysis

Various documents provide evidence about a trial.

What happened?

Who was charged with what? Who prosecuted? Who was the judge? What was the outcome?
Why do we think that that is what happened?

What evidence identifies defendant? charge? prosecutor? date? outcome?
What is the correct text of each source?

What documents witness the text? ...

These are all conceptually distinct. But we need all of them. (Cf. Manfred Thaller.)

Analysis

Are these actually all the same problem?

A set of overlapping phenomena:

multiple structures with shared content (overlap)
multiple structures with some shared and some unshared content
multiple notations / representations of ‘same’ thing
same / different / overlapping content
same / different ordering

Approaches

selection
duplication of documents (uncontrolled redundancy)
constrained duplication (controlled redundancy)
- single organization with redundancy at leaves
- multiple organizations with shared substructures
single-source
- view materialization (one ‘master’ view, others derived)
- dominant/recessive transformations

N.B. Not completely systematic.

Selection

Choose a view; let the rest go. Summarize facts, link to sources. (Hypertext is your friend.)

Duplication (uncontrolled redundancy)

Just make as many digital resources as you want.

print dictionary and lexical database
one digitization for the text, another for the document
orthographic and phonemic (phonetic, ...) transcription
one digitization for each witness

Problems with uncontrolled redundancy

Updates are costly. So? Who cares?
Errors need multiple correction.
Inconsistency creeps in.
Searching across documents hard.
- Find all words containing /æ/ not spelled “a”.
- Find all verses where “riter” in Ms. A opposes “recke” in C.
- Find all verses where A and C both have “riter”.
- ...

Shared global structures, local redundancy

Where

global structures are shared
fine-grained content (and structure?) differs

it's easy to place alternatives next to each other.

Local redundancy.

Local redundancy in dictionaries (1)

A straight encoding of Johnson's entry:

	<entry>
	  <form>
	    <orth>oathbrea&#x2032;king</orth>
	  </form>
	    
	  <gramGrp>
	    <pos>n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author>Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>

Local redundancy in dictionaries (2)

Adding pronunciation, marking as noun, identifying Shakespeare:

	<entry>
	  <form>
	    <orth>oathbrea′king</orth>
	    <pron value="ˈəʊθˌbreɪkɪŋ"/>
	  </form>
	  <gramGrp>
	    <pos norm="noun">n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author key="Shakespeare">Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>

Local redundancy in dictionaries (3)

If we use numeric character references, ...

	<entry>
	  <form>
	    <orth>oathbrea&#x2032;king</orth>
	    <pron value="&#x02C8;&#x0259;&#x028A;&#x03B8;&#x02CC;bre&#x026A;k&#x026A;&#x014B;"/>
	  </form>
	  <gramGrp>
	    <pos norm="noun">n. s.</pos>
	  </gramGrp>
	  <etym>
	    <mentioned>oath</mentioned> and <mentioned>break</mentioned>.
	  </etym>
	  <sense>
	    <def>Perjury; the violation of an oath.</def>
	    <cit type="example">
	      <q>
		<l rend="indent">His <oRef/> he mended thus,</l>
		<l>By now forswearing that he is forsworn.</l>
	      </q>
	      <bibl><author key="Shakespeare">Shakesp.</author></bibl>
	    </cit>
	  </sense>
	</entry>

Local redundancy in tiered linguistic annotation

Fragment of a Uyghur sentence:

      <s lang="uig" who="Mejit Axun" ref="1">
        <w>
          <ow>
            <m>
              <orth>Aɫti</orth>
              <seg>aɫtʰi</seg>
              <pos>NU</pos>
              <ilg>six</ilg>
            </m>
            </ow>
            <ow>
              <m>
                <seg>ʃɛːr</seg>
                <pos>N</pos>
                <ilg>city</ilg>
              </m>
            </ow>
         </w>
         <w>
           <m>
             <seg>dɛ</seg>
             <pos>Vt</pos>
             <ilg>say</ilg>
           </m>
           <m>
             <seg>gɛn</seg>
             <pos>PRTC.RZR</pos>
             <ilg>PRTC.RZR</ilg>
           </m>
         </w>
         <w>
           <m>
             <seg>ʃɛːr</seg>
             <pos>N</pos>
             <ilg>city</ilg>
           </m>
           <m>
             <seg>lɛr</seg>
             <pos>PL</pos>
             <ilg>PL</ilg>
           </m>
         </w>
         <w>
           <m>
             <seg>ʃu</seg>
             <pos>PN.DEM</pos>
             <ilg>this</ilg>
           </m>
           </w>
            ...

Multiple organizations with shared substructures

Standard form of ‘overlap problem’.

Some well understood approaches:

CONCUR
Trojan Horse markup
Standoff markup

Peter and Paul — two structures

CONCUR

Mark multiple structures, each labeled.

<!DOCTYPE div SYSTEM "tei/dtd/teispok2.dtd">
<!DOCTYPE text SYSTEM "tei/dtd/teiana2.dtd">
<(div)div type="dialog" org="uniform">
  <(text)text>
     <(div)u who="Peter">
       <(text)s>Hey Paul!</(text)s>
       <(text)s>Would you give me
     </(div)u>
     <(div)u who="Paul">
       the hammer?</(text)s>
     </(div)u>
  </(text)text>
</(div)div>

CONCUR (2)

Plus:

easy to understand
easy to process one tree at a time
easy to validate (DTDs, rabbit/duck grammars)

Minus:

not widely implemented (but easy to implement in XML)
desired editor behavior unclear*
what data structure to use?

Trojan Horse markup

When an element doesn't fit, tag its start and end with empty elements (here second <s>):

<div type="dialog" org="uniform">
  <u who="Peter">
    <s>Hey Paul!</s>
    <s sID="s2"/>Would you give me
  </u>
  <u who="Paul">
    the hammer?<s eID="s2"/>
  </u>
</div>

Standoff markup

Point in to the data from elsewhere.

<documentset>
  <div type="dialog" org="uniform">    
    <u who="Peter">
      <seg xml:id="t1">Hey Paul!</seg>
      <seg xml:id="t2">Would you give me</seg>
    </u>
    <u who="Paul">
      <seg xml:id="t3">the hammer?</seg>
    </u>
  </div>
  <div type="standoff s">
    <text>      
      <s xml:id="s1">
	<mirror target="#t1"/>
      </s>
      <s xml:id="s2">
	<mirror target="#t2 #t3"/>
      </s>
    </text>
  </div>
</documentset>

Standoff markup (bis)

The TEI <join> element is designed specifically for this.

<documentset>
  <div type="dialog" org="uniform">    
    <u who="Peter">
      <seg xml:id="t1">Hey Paul!</seg>
      <seg xml:id="t2">Would you give me</seg>
    </u>
    <u who="Paul">
      <seg xml:id="t3">the hammer?</seg>
    </u>
  </div>
  <div type="standoff s">
    <text>
      <join xml:id="s1" result="s"
	    target="#t1"/>
      <join xml:id="s2" result="s"
	    target="#t2 #t3"/>
    </text>
  </div>
</documentset>

Single-source (view materialization)

Define a ‘master view’ with all information.

(Often complex!)

Generate task-specific views by filtering information out

at need (dynamically)
in advance (statically)

Challenge of view materialization

If the task-specific view is read-only, fine.

If it generates / edits information ...

... you have to get the new info into the master view.

Dominant/recessive transformations

Define two Trojan-Horse representations:

View A dominant (with B as milestones)
View B dominant (with A as milestones)

Transform back and forth.

Dominant/recessive transformations (example)

Utterance dominant:

<div xml:id="d1" type="dialog" org="uniform">
  <text sID="t1"/>
  <u xml:id="u1" who="Peter">
  <s sID="s1"/>Hey Paul!<s eID="s1"/>
  <s sID="s2"/>Would you give me
  </u>
  <u xml:id="u2" who="Paul">
    the hammer?<s eID="s2"/>
  </u>
  <text eID="t1"/>
</div>

This is essentially CONCUR using Trojan-Horse markup for non-primary element structures.

Sentence dominant

Turning the tables:

<text xml:id="t1">
  <div sID="d1" type="dialog" org="uniform"/>
  <u sID="u1" who="Peter"/>
  <s xml:id="s1">Hey Paul!</s>
  <s sID="s2">Would you give me
  <u eID="u1"/>
  <u sID="u2" who="Paul"/>
    the hammer?</s>
  <u eID="u2"/>
  <div eID="d1"/>
</text>

Challenges of dominant/recessive representations

Validation
Schema development

Some TEI constructs

milestone elements
extension: Trojan-horse markup
analysis and correspondence links
stand-off annotation
- feature structures
- other annotation
- <join> element