Igel

Igel Comparing document grammars using XQuery C. M. Sperberg-McQueen, Black Mesa Technologies LLC Oliver Schonefeld, Marc Kupietz, Harald Lüngen, Andreas Witt, Institut für Deutsche Sprache &date.last.touched;

Overview Context Current state Further work

Context Institut für Deutsche Sprache (IDS), Mannheim DeReKo I5 (IDS flavor of TEI P5) problems of document grammar comparison

IDS (1) Institut für Deutsche Sprache (= Institute for the German language) founded 1964 independent research institute Member of Leibniz-Gemeinschaft (= Leibniz Association) Mission: "Research and document the German language in its use at present and in modern history (= ~1950++)" Funded by federal government and the state Baden-Württemberg

IDS (2)

Four research departments Grammar Lexis Pragmatics Zentrale Forschung (= central research) Korpuslinguistik (= corpus linguistics) Forschungsinfrastukturen (= research infrastructure) support units: administration, public relations, library, (small) computing center

DeReKo

IDS hosts several large (and unique) collections of German language resources

Amongst them: DeReKo worldwide largest archive of corpora of contemporary written German 6.1 billion word tokens contains fiction, scientific texts, newspaper articles, and a wide variety of other text types legal agreements with text donors (e.g. publishers) ... therefore not available for download, but searchable through custom search engine

DeReKo uses a number of formats for representing corpora including a customized version of XCES (IDS-XCES) recent efforts: converting DeReKo to TEI P5 ... of course customized, we call it I5 no immediate benefit for "outside" parties, but IDS hopes I5 will help with internal work flows ease building and maintenance of quality assurance tools abandon the older in-house annotation format enable new project members to familiarize themselves more quickly and easily with the model

IDS collaborations

IDS collaborates with several other research institutes in various projects. Among them: Berlin Brandenburgische Akademie der Wisschenschaften (BBAW) also has a large collection of corpora in a variant TEI P5 ... but it's not quite the same, as ours

Finding common ground in the TEI

How does our subset of TEI compare to their subset?

Elements What elements do we have that they don't? What elements do they have that we don't? If we both have the element, does it have compatible content models? Attributes? Parameter entities? How do the two document grammars compare to what is actually in the data?

Current state system overview demo loading DTDs component lists declaration display

Data flow

DTD → dpp (DTD pre-processor) → XML → XML database (BaseX)

System structure HTML client (COTS browser) HTTP server (Apache) XQuery server (BaseX) intermediary bop.php (BaseX-over-PHP)

Demo

Watch this one ...

Future work deeper queries content-model visualization weighted automata

Deeper queries

How do these grammars relate? What is their union (valid against either)? What is their intersection (valid against both)?

How do these grammars differ? A minus B (valid against A, invalid against B)? B minus A (valid against B, invalid against A)?

But wait, isn't that hopeless?

Example: biblFull

Grammar A: ( (titleStmt, editionStmt?, extent?, publicationStmt), sourceDesc*)

Grammar B: ( (titleStmt, editionStmt?, extent?, publicationStmt, seriesStmt?, notesStmt?), sourceDesc*)

A ∪ B: ≡ B

A ∩ B: ≡ A

A - B: ∅

B - A: ( (titleStmt, editionStmt?, extent?, publicationStmt, ((seriesStmt, notesStmt?) | (notesStmt)), sourceDesc*)

But wait, this one really IS hopeless!

Visualization

Content models are just regular expressions.

So they can be shown as finite state automata.

E.g.

The other grammar

E.g.

Another view

E.g.

A complication

This doesn't work well for all elements. Here is p (paragraph):

A complication (2)

Fortunately, GraphViz has multiple layout algorithms. Here is another:

Handling mixed content

Mixed content (in XML-DTD style) requires special handling:

Minimized automata

Minimized FSA are smaller:

Weighted automata

But how much of the content model is cruft?

Let's color-code for frequency:

What's actually in paragraphs?

Limitations, regrets

No db in the cloud; you have to install. Currently DTD syntax only; other schema languages non-zero priority. UI currently primitive; XForms will help.

Show me the code, ...

https://github.com/BlackMesaTechnologies/Igel.git

Acknowledgements Photo: Detail from Black Mesa, by Marcin Wichary, 9 September 2008 (CC BY 2.0)