Aparecium

An XQuery / XSLT library for invisible XML

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

5 November 2021

http://blackmesatech.com/2021/11/Aparecium/


Overview

What is it?

Invisible XML

Steven Pemberton (2013) at Balisage: “a method for treating non-XML documents as if they were XML”
Combines two ideas:
  • Documents have (information in general has) structure worth exposing to software.
  • Non-XML resources often have implicit structure.
Two audiences:
  • Authors can work in non-XML notations.
  • XSLT / XQuery programmars can read and process non-XML resources.

Example: Natural language

“Sarah laughed.”

Example (2): Natural language

“The cows on the hill ate grass.”

Example (2b): Natural language

“The cows on the hill ate grass.”
    <transitive-sentence>
      <np>
        <det>the</det>
        <n>cows</n>
        <pp>
          <prep>on</prep>
          <np>
            <det>the</det>
            <n>hill</n>
          </np>
        </pp>
      </np>
      <vp-t>
        <v-t>ate</v-t>
        <np>
          <n>grass</n>
        </np>
      </vp-t>
      <stop>.</stop>
    </transitive-sentence>    

Example (3): Artificial language

CSS as I write it:
td {
    vertical-align: top;
}
img {
    text-align: center;
}
CSS as I might want to process it:
<css>
    <rule selector="td">
        <property name="vertical-align" value="top"/>
    </rule>
    <rule selector="img">
        <property name="text-align" value="center"/>
    </rule>
</css>

Aparecium

Hermione was pulling her wand out of her bag.
‘It might be invisible ink!’ she whispered.
She tapped the diary three times and said, ‘Aparecium!

-J. K. Rowling, Harry Potter and the Chamber of Secrets

Show and tell

The easiest way to learn about invisible XML is by example.
Demo:
  • list of numbers
  • s-expressions
  • arithmetic expressions
Note:
  • grammar notation
  • hiding nonterminals, terminals
  • attributes or elements

Some notes

  • proof of concept, not industrial code
  • (he means: it's slow and not well debugged)
  • (I mean, like, really kind of slow ...)
  • (sooooooo slow ...)

How do I use it?

No restrictions

Invisible XML specifies notation but does not restrict the grammar.
  • It need not be LL(1).
  • It need not be LL(k).
  • It need not be LALR(1).
  • It need not be unambiguous.
Any context-free grammar is accepted.

Show and tell, again

Demo: look at the code
  • aparecium:parse-string()
  • aparecium:compile-grammar-from-string()
  • aparecium:parse-string-with-compiled-grammar()

How does it work?

Earley parsing

Consider parsing a string (“Sarah laughed”) against a grammar:
-sentence: transitive-sentence; intransitive-sentence.
transitive-sentence:  np, vp-t, stop.
intransitive-sentence:  np, vp-i, stop.

np: pn; det?, n, pp*; pron.
vp-t:  v-t, np, pp*.
vp-i:  v-i, pp*.
pp: prep, np.
    

Earley parsing (2)

Recursive-descent parsing requires that we know which we want, now, without looking further.
Earley parsing tries both:

Earley parsing (3)

Earley parsing (in a nutshell)

Extended BNF

Earley uses BNF; we use EBNF.
Position in a BNF rule is simple, can be just a number from 0 to length of rule.
Two approaches:
  • Compile EBNF rules to (recursive) BNF.
  • Extend notion of rule position.
Aparecium takes second approach.

Compiling the grammar (1)

Each rule in an iXML (EBNF) grammar is a regular expression:
  • symbol (non-terminal or terminal)
  • E?
  • E*
  • E+
Also repetitions with separators:
  • E*F (zero or more E, separated by F)
  • E+F (one or more E, separated by F)
  • (E)

Compiling the grammar (2)

Brüggemann-Klein 1993 shows (following Gluschkov) how to make an FSA from a regular expression (in quadratic time):
  • Let states be the primitive symbols of the RE.
  • For each sub-expression E, calculate (bottom up)
    • first(E) (initial states of E)
    • last(E) (final states of E)
    • for each state q in E, and each symbol s in alphabet, also
      follow(q, s, E) (transitions in q within E)

Compiling the grammar (3)

For example:
expression: term, (s, addop, s, term)*.
This produces

The compiled rule

The compiled rule
  • assigns an ID to each symbol (state)
  • decorates each expression with first, last, nullable, and follow:* attributes.
    Aparecium needs the ones on the outermost definition.
  <rule name="expression"
	xmlns:follow="http://example.com/followset">
    <def xml:id="exp_def_1"
	 nullable="false"
	 first="term_1"
	 last="term_1 term_2"
	 follow:term_1=" s_1"
	 follow:s_1=" addop_1"
	 follow:addop_1=" s_2"
	 follow:s_2=" term_2"
	 follow:term_2="s_1">
       ...
     </def>
  </rule>

Ambiguity

The iXML spec returns one tree; in case of ambiguity, signal that there are others.
Aparecium wants to return all trees (to help debug the grammar).

Ambiguity (2)

But ... some languages are infinitely ambiguous:
  • A: A; "a".
    Unbounded depth (‘vertically infinite’):
    • <A>a</A>
    • <A><A>a</A></A>
    • <A><A><A>a</A></A></A>
    • ...
  • S: X*.
    X: "x"; {nothing}.
    Unbounded width (‘horizontally infinite’). Parses for the empty string include:
    • <S><X/></S>
    • <S><X/><X/></S>
    • <S><X/><X/><X/></S>
    • ...

Ambiguity (3)

Handling infinite ambiguity:
  • Loop detection.
    • <A>a</A>
    • <S><X/></S>
  • Parse-forest grammars (grammatical representation of parse tree).
    • A: A_0_1.
      A_0_1: A_0_1; "a".
    • S: S_0_0.
      S_0_0: X_0_0*.
      X_0_0: {nothing}.

Ambiguity (4)

Demo:
  • PP attachment ambiguity: “the cows eat the grass on the hill”.
  • infinite tree sets
    • S: S; 'a'.
    • S: X*. X: 'x'; {}.

Representation of Earley items

  • As element:
    <item from="0" to="20" ri="term_2">
      <rule name="expression">
        ...
      </rule>
    </item>
  • As map:
      map {
        'from' : $From,
        'to' : $To,
        'rule' : $r,
        'ri' : $ri
      }

Performance

Earley parsing creates a lot of Earley items:
at least one per character.
What would you do?

What I did

Measure first. (D'oh!)
The major time cost was not constructing Earley items but finding them in the set.

What now?

Related work

  • Invisible XML:
    • XML::Invisible Perl package (“Ed J”)
    • Functional invisible XML parser (John Snelson)
    • Earley parsers in ABC, C (Steven Pemberton)
    • Jay Parser (Tom Hillman)
  • Parsing in XSLT, XQuery:
    • REx (Gunther Rademacher)
    • FXSLT LR-1 parser (Dimitre Novatchev)
  • Exposing non-XML resources in XML (‘XML lenses’):
    • DFDL Data Format Definition Language (Open Grid Forum)
    • Data Direct Connectors
    • Regular fragmentation (Simon St. Laurent 2001)
    • STnG (Ari Krupnikov 2003)

What next?

Thanks

Thank you.
Any questions?