Questions about TEI

Sample answers and commentary for the assignment due 14 October 2012.

Questions are shown as follows:

For the TEI vocabulary, answer the recurrent document-design questions (below). Find the answers either by consulting the spec(s) or by trying things against the DTDs.

Expected answers are shown for each question, formatted like this paragraph.

For the most part, the answers given here are based on TEI P5, the current version of the TEI vocabulary. But it is possible that in some cases my memory of TEI P3 and TEI P4 has let me to say things that are true of those versions of TEI but not of TEI P5.

For some questions, additional comments that go beyond the expected answer are also shown, formatted like this paragraph.

The questions are these:

TEI provides generic markup for sections using the div element, or alternatively the elements div1, div2, div3, div4, div5, div6, and div7 elements, where the numeric suffix indicates the depth of nesting.

Section headings are uniformly marked head; headings are optional, as are other specialized elements for the beginnings of sections (byline, dateline, epigraph, etc.).

Lists of all kinds are marked up using the list element; distinctions among bulleted (or unordered) lists, numbered (or ordered) lists, and glossary or definition lists are made using the type attribute. (In some vocabularies, including ISO 8879 Annex E and HTML, the same distinctions are made by using distinct element types.)

Paragraphs are tagged using p, notes (both footnotes and endnotes) using note.

Lists and notes can occur both within and between paragraphs (so they behave in some ways like paragraph-level elements and in some ways like phrase-level elements). Both list items and notes may contain p elements, but neither is required to: when the list item would contain a single p element, the tags for the p element may be omitted, and the item element serves as the container for the paragraph. (So also for single-paragraph notes.)

The type attribute on list is a paradigm case of a semi-closed list: the documentation lists the values ordered, bulleted, simple, and gloss, with the suggestion that software be prepared to handle those types properly. Other values can be provided, however, when needed.

TEI does provide specialized elements for some kinds of list (listBibl for lists of bibliographic references, listWit for lists of text-critical witnesses (manuscripts and editions), listOrg, listEvent, listPerson, listPlace, listNym for lists of organizations, events, persons, places, and names, respectively.

In retrospect, this user of TEI believes that it was a mistake not to separate glossaries out from bulleted and numbered lists. The required content of glossaries is different, and existing TEI software is inconsistent in whether it uses the type attribute or the existence of label elements among the children of a list to trigger the special handling needed for conventional layout of glossary lists.

It was also a mistake to specify the core of the content model (item+ | (label, item)+): when labels are provided, the label and item belong together and it would be more useful to tag them as a single unit. The failure to provide a grouping element for term and definition is particularly irritating when processing glossary lists; the fact that the error is shared by other vocabularies is no consolation.

And while in a confessional vein, I'll say that experience in recent years has made me prefer a different approach to paragraphs inside list items and notes. The idea of making single-paragraph list items appear as <item>...</item> instead of as <item><p>...</p></item> seemed like a good idea at the time, but it is not significantly more convenient for authoring purposes (especially not with a good XML editor), and it is significantly less convenient for processing.

The TEI has a profusion of specialized phrase-level elements intended to identify typographically distinct phrases and to specify, at the same time, the reason for their special typographic treatment. Section 3.3 of the TEI Guidelines describes the elements most obviously relevant to this question (foreign, emph, distinct, mentioned, term), but the rest of chapter 3 defines a number of other phrase-level elements available in the TEI's ‘core’ tag set. Other chapters define phrase-level elements suitable for specialized types of document or for specialized interests.

Quotations, whether set as display quotations or run in, are tagged using q, quote, and/or said.

The element said is provided specifically for tagging direct discourse, but direct discourse may also be tagged using q.

Some participants in the development of the TEI vocabulary leaned toward the view that direct discourse in a novel is a form of quotation, and they wished the same element to be used both for direct discourse (viewed as quotation of the characters in a narrative) and for quotation in expository prose (actual quotation, typically of other authors.

Other participants objected to this, on the grounds that the words attributed to characters in novel were, in reality, composed by the novel's author, and so do not constitute quotation. It was important, they felt, to distinguish what was actually written by the author of the text and what was written by others, in part because quantitative studies of style or careful linguistic analysis might wish to exclude quoted material from corpus used to characterize an author. (If an author tends to relatively short sentences, a few quotations from Kant could completely distort the author's statistical profile.)

TEI P3 and P4 attempted to navigate this dispute by providing q for use both for ‘real’ quotation and for direct discourse, and quote for projects which wished to distinguish ‘real’ quotation. Unfortunately, it's not always possible to tell when a quotation is real. (Writers of fiction, in particular, may purport to quote other authors when they have in fact written the allegedly quoted material themselves.) So the documentation for quote allows it to be used for any purported quotation. This makes it very hard to understand the intended distinction between q and quote.

In TEI P5, the analysis has been changed in a way that some may regard as an improvement and others as a further muddying of the waters. A new said element is introduced specifically for direct discourse (both in fiction and in reportage). And the scope of q is broadened to include anything with quotation marks around it. This seems to make q serve the same purpose for material in quotation marks as hi serves for material in italics or boldface (i.e. it has a solely typographic meaning, with no information on the reason for the special typographic treatment).

The lg and l (lower-case el) elements for verse, and the sp (speech), speaker, and stage elements are part of the TEI core.

Additional elements for dramatic texts and verse texts are provided in chapters 6 and 7 of the Guidelines.

The app element, defined in chapter 12, Critical Apparatus, is designed to hold information about individual points of variation among the witnesses to a text. It contains elements for the variant readings, which can be identified as belonging to specific witnesses.

The choice element in the core tag set can be used for simple cases of variation: regularized or normalized spelling vs old spelling, corrected vs uncorrected text, abbreviation vs expanded form. It is not intended or suitable for recording (say) the differences the the Folio and Quarto texts of a play by Shakespeare, or the textual variations among different editions of a poem by Yeats.

In addition to the generic note element, there are provisions for a wide variety of forms of annotation which differ in formality and weight.

There are predefined elements for some linguistic analysis (s, cl, phr, w, and m for sentences, clauses, phrases, words, and morphemes; they can carry attributes with specification of their type and linguistic function).

The interp element is provided to allow light-weight unstructured annotation; the fs (feature-structure) element and a whole family of elements for its content provide for tightly structured annotations.

It is not clear why the interp element is needed, given that note already exists and provides the same facility for unstructured annotation that can be linked to arbitrary locations in the text. Some of those involved in developing the TEI vocabulary wanted a separate element to distinguish the notes present in the exemplar of an electronic text from annotations added by others in the course of their work with an electronic text. This is not a particularly compelling argument, since note has type and resp attributes precisely in order to allow notes from different annotators to be distinguished. But in the end, including the separate element seemed likely to make the TEI vocabulary more palatable to some potential users. (And it could not be argued that the TEI never included multiple element types that mean the same or similar things. Perhaps it should not do so, but that ship had sailed very early.)

Feature-structure markup was developed by the TEI working group on linguistic analysis, as feature structures are a commonplace in linguistic analysis. The mechanism is general enough, however, that it can be used for essentially any kind of structured information. When carried to its logical extreme, feature structure markup is an alternative both to XML itself and to database management systems. Perhaps fortunately, it is seldom carried to that extreme.

An example of non-linguistic use of TEI feature structures is the CATMA system developed by Jan Christoph Meister, a narratologist at the University of Hamburg, for annotation of narrative texts.

Hyperlinks are represented in TEI using the ref and ptr elements. (The former has textual content, the latter does not; the expectation is that the formatting or rendering system will supply appropriate text such as the number or title of the section being pointed at, or the number of the figure if it is a figure being pointed at, etc.)

Both incoming and outgoing links may be asserted using the link element.

The ref and ptr elements resemble the a element of HTML (when it carries an href attribute): they assert the existence of a hyperlink, one end-point of which is located at the ref, ptr, or a element itself. That is, one end-point of the hyperlink whose existence is asserted is invariably located at the element asserting the link. The link element, by contrast, asserts the existence of a hyperlink whose end-points are all distinct from the link element itself. It may thus be used to assert hyperlinks whose endpoints are all located in other documents.

There was a time when hypertext theorists regarded the necessity for placing a link assertion at one of the link's end-points as sign of a crude and primitive hypertext system. (This is one reason hypertext enthusiasts were not originally much impressed by the World Wide Web.)

In TEI P3 and P4, ref and ptr were restricted to internal links, while the separate elements xref and xptr were provided for external links. By the time TEI P5 was developed, experience with the World Wide Web has made such a distinction appear arbitrary and unhelpful, so the two pairs of elements were merged.

There are TEI elements for all the special items mentioned except weights (for which use measure) and URIS.

TEI provides an inline header (called teiHeader) with (a) a full bibliographic description of the electronic document itself, (b) optional descriptions of non-bibliographic properties of the electronic document itself and the work it instantiates, and (c) a change history to keep track of modifications to the document.