Aparecium
An XQuery / XSLT library for invisible XML
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
5 November 2021
http://blackmesatech.com/2021/11/Aparecium/
 
Overview
- What is it?- invisible XML?
- Aparecium?
 
- How do I use it?
- How does it work? - Earley parsing
- Extended BNF
- Ambiguity
- Speed
 
- What now?
 
What is it?
- What is invisible XML?
- What is Aparecium?
 
    Invisible XML
    
      Steven Pemberton (2013) at Balisage:
      “a method for treating non-XML documents
	as if they were XML”
    
    Combines two ideas:
    
- Documents have (information in general has)
      structure worth exposing to software.
- Non-XML resources often have implicit
      structure.
Two audiences:
    
- Authors can work in non-XML notations.
- XSLT / XQuery programmars can read
      and process non-XML resources.
 
    Example: Natural language
    “Sarah laughed.”
    
   
    Example (2): Natural language
    “The cows on the hill ate grass.”
    
   
    Example (2b): Natural language
    “The cows on the hill ate grass.”
    
    <transitive-sentence>
      <np>
        <det>the</det>
        <n>cows</n>
        <pp>
          <prep>on</prep>
          <np>
            <det>the</det>
            <n>hill</n>
          </np>
        </pp>
      </np>
      <vp-t>
        <v-t>ate</v-t>
        <np>
          <n>grass</n>
        </np>
      </vp-t>
      <stop>.</stop>
    </transitive-sentence>     
    Example (3): Artificial language
    CSS as I write it:
td {
    vertical-align: top;
}
img {
    text-align: center;
}CSS as I might want to process it:
<css>
    <rule selector="td">
        <property name="vertical-align" value="top"/>
    </rule>
    <rule selector="img">
        <property name="text-align" value="center"/>
    </rule>
</css> 
    Aparecium
  
    
       Hermione was pulling her wand out of her bag.
      ‘It might be invisible ink!’ she whispered.
      She tapped the diary three times and said, ‘Aparecium!’
     
    -J. K. Rowling, Harry Potter and the Chamber of Secrets
  
  - proof of concept implementation
- for use in XSLT* and XQuery
- basic idea:
    - Like doc(), only for non-XML data.
- Like unparsed-data(), only it returns a document.
 
 
    Show and tell
    The easiest way to learn about invisible XML
    is by example.
    Demo:
    
- list of numbers
- s-expressions
- arithmetic expressions
    Note:
    
- grammar notation
- hiding nonterminals, terminals
- attributes or elements
 
   
    Some notes
    
    - proof of concept, not industrial code
- (he means:  it's slow and not well debugged)
- (I mean, like, really kind of slow ...)
- (sooooooo slow ...)
 
   
How do I use it?
- prerequisites:
    - XQuery engine
- input
- grammar
 
- Using Aparecium:
    - import the module
- read the grammar and the input
- parse the string
 
 
    No restrictions
    Invisible XML specifies notation but does not restrict the grammar.
    
- It need not be LL(1).
- It need not be LL(k).
- It need not be LALR(1).
- It need not be unambiguous.
Any context-free grammar is accepted.
   
    Show and tell, again
    Demo: look at the code
    
- aparecium:parse-string()
- aparecium:compile-grammar-from-string()
- aparecium:parse-string-with-compiled-grammar()
 
   
How does it work?
- Earley parsing
- Extended BNF
- Ambiguity
- Speed
 
    Earley parsing
    Consider parsing a string (“Sarah laughed”) against a
    grammar:
    
-sentence: transitive-sentence; intransitive-sentence.
transitive-sentence:  np, vp-t, stop.
intransitive-sentence:  np, vp-i, stop.
np: pn; det?, n, pp*; pron.
vp-t:  v-t, np, pp*.
vp-i:  v-i, pp*.
pp: prep, np.
    
     
    Earley parsing (2)
    - We need a sentence (starting at offset 0).
- So we need either a transitive-sentence
      or an intransitive-sentence.
Recursive-descent parsing requires that we know
    which we want, now, without looking further.
    Earley parsing tries both:
    - Look for a transitive-sentence starting at 0.
- Look for an intransitive-sentence starting at 0.
- So, look for an np starting at 0.
- Look for a pn starting at 0.
- Look for a det starting at 0.
- Look for an n starting at 0.
- Look for a pron starting at 0.
 
    Earley parsing (3)
    - Look for any of “Sam”, “Sarah”, ... starting at 0.
- Win:  we found “Sarah” starting at 0.
- ...
- So look for a vp starting at 6.
- Look for any of “the”, “some”, ... starting at
      0. (... nope)
- Look for any of “cows”, “plant”, “grass”,
      “people”, “telescope”, “hill” ... starting at
      0. (... nope)
- Look for any of “you”, “we”, ... starting at
      0. (... nope)
 
    Earley parsing (in a nutshell)
    - Try everything that could possibly succeed. - (Differs here from many parsing methods.) 
- Try nothing that is not known to be relevant. - (Differs from CYK and some other general methods here.) 
- Keep track of what you try, using Earley
      items:
      (from-position in input,
      to-position in input,
      rule,
      position in rule). - When you generate the item
      (0, length, Goal,
      final),
      you're done.  Construct a tree from the set of items. 
 
    Extended BNF
    Earley uses BNF; we use EBNF.
    Position in a BNF rule is simple, can be just a number from 0 to
    length of rule.
    Two approaches:
    
- Compile EBNF rules to (recursive) BNF.
- Extend notion of rule position.
Aparecium takes second approach.
   
    Compiling the grammar (1)
    Each rule in an iXML (EBNF) grammar is a regular expression:
    
- symbol (non-terminal or terminal)
- E?
- E*
- E+
    Also repetitions with separators:
    
- E*F (zero or more
      E, separated by F)
- E+F (one or more
      E, separated by F)
- (E)
 
    Compiling the grammar (2)
    Brüggemann-Klein 1993 shows (following Gluschkov) how
    to make an FSA from a regular expression (in quadratic time):
    
- Let states be the primitive symbols of the RE.
- For each sub-expression E, calculate
      (bottom up)
      - first(E) (initial states of E)
- last(E) (final states of E)
- for each state q in E,
	and each symbol s in alphabet, 
	also
 follow(q, s,
	E) (transitions in q within E)
 
 
    Compiling the grammar (3)
    For example:
    
expression: term, (s, addop, s, term)*.
This produces
    

     
    The compiled rule
    The compiled rule
    
- assigns an ID to each symbol (state)
- decorates each expression with first,
      last, nullable, and
      follow:* attributes.
      
 Aparecium needs the ones on the outermost definition.
  <rule name="expression"
	xmlns:follow="http://example.com/followset">
    <def xml:id="exp_def_1"
	 nullable="false"
	 first="term_1"
	 last="term_1 term_2"
	 follow:term_1=" s_1"
	 follow:s_1=" addop_1"
	 follow:addop_1=" s_2"
	 follow:s_2=" term_2"
	 follow:term_2="s_1">
       ...
     </def>
  </rule>
     
    Ambiguity
    The iXML spec returns one tree; in case of
    ambiguity, signal that there are others.
    Aparecium wants to return all trees
    (to help debug the grammar).
   
    Ambiguity (2)
    But ... some languages are infinitely ambiguous:
 
    Ambiguity (3)
    Handling infinite ambiguity:
 
    Ambiguity (4)
    Demo:
- PP attachment ambiguity: “the cows eat the grass on the hill”.
- infinite tree sets- S: S; 'a'.
- S: X*. X: 'x'; {}.
 
 
   
    Representation of Earley items
    - As element:<item from="0" to="20" ri="term_2">
  <rule name="expression">
    ...
  </rule>
</item>
- As map:  map {
    'from' : $From,
    'to' : $To,
    'rule' : $r,
    'ri' : $ri
  }
 
    Performance
    Earley parsing creates a lot of
    Earley items:
    
    at least one per character.
    What would you do?
   
    What I did
    
    Measure first. (D'oh!)
    The major time cost was not constructing Earley items
    but finding them in the set.
   
What now?
- mention related work
- plans
 
    Related work
    
      - Invisible XML:
	- XML::Invisible Perl package (“Ed J”)
- Functional invisible XML parser (John Snelson)
- Earley parsers in ABC, C (Steven Pemberton)
- Jay Parser (Tom Hillman)
 
- Parsing in XSLT, XQuery:
	- REx (Gunther Rademacher)
- FXSLT LR-1 parser (Dimitre Novatchev)
 
- Exposing non-XML resources in XML (‘XML lenses’):
	- DFDL Data Format Definition Language (Open Grid Forum)
- Data Direct Connectors
- Regular fragmentation (Simon St. Laurent 2001)
- STnG (Ari Krupnikov 2003)
 
 
   
    What next?
  - Move from proof of concept to practical tool
    - Release engineering
- Speed improvements
- Better error reporting / recovery
 
- Easier use
    - Grammar repository for standard notations (CSS, JSON, XPath, XPointer, ...)
- Automatic selection of grammar based on MIME type?
 
- Grammar analysis tools
    - Is the grammar ambiguous?
- Is the grammar LL(1)? LALR(1)? ...
- Is a recursive descent parser possible?  Would it be faster?
- ...
 
- Sample applications
 
Thanks
Thank you.
Any questions?