URI Desk Calculator

This page allows the user to test the effect of various string functions used in URI creation, processing, or interpretation. Its initial intention is to help debug the atomic pieces of what will become more elaborate processing routines, but it may also be helpful in exploring the consequences of various design decisions in URI-related specifications.

For simple calculations involving numbers, it's convenient to have a calculator on your desk, preferably one with a paper tape to record the input data, the operations, and the results. This page provides something roughly analogous, for operations involving URIs. There is a buffer containing a string, on which various operations can be performed, including analysing the string (if it's a legal URI reference) into its component parts. And at the bottom, there is a log of the data, the operations performed, and their results. It's not quite a paper tape, but it's something.

Notes

N.B. at the moment, not all of these operations have yet been implemented; be patient.

The string and hex buffers

Character buffer: This is a normal HTML text input field.
Hex buffer: This hex buffer contains the UTF-8 equivalent of the character buffer, with extra whitespace added for legibility. The equivalence is as calculated by Richard Ishida's Unicode Code Converter (v6).

Populating the buffers

You can type data into the string and hex buffers directly, or generate test data in a variety of ways.

Select random sample URI

Populate the string buffer with a URI or URI reference randomly selected from examples gathered from the text of RFC 3986 (and eventually from other sources). Some examples may be intended by the specs as bad examples. Others may be illustrating syntax problems; for this reason the examples are not guaranteed valid. (If we were being really careful here, we wouldn't call them URI references, for this reason.)

Generate pseudo-random strings

Populate the buffers with a string generated using a pseudo-random-number generator. Several generators are available:

Printable ASCII characters: Characters in the range U+0020 .. U+007F.
Octets: Octets in the range 0 .. 255.
UCS characters: Characters from the Universal Character Set defined by ISO 10646 and Unicode. The surrogate code points are excluded, because they are an encoding artifact and do not represent characters.
URI fragments: The string is constructed by concatenating randomly selected characters and strings matching the various terminal and non-terminal productions of the URI grammar of RFC 3986. This generates slightly more ‘interesting’ test cases than the other random-string generators. The length is randomly selected.

The length of these strings is also randomly selected, too.

The find interesting string operations use the random string generators just listed to create strings which are tested for certain properties.

The URI components

To illustrate the process of analysing URIs and URI references into their component parts, separate text widgets are provided for several of the important components of the URI:

scheme: The URI scheme (http, ftp, file, telnet, ...).
authority: The hierarchically organized naming authority included by some URI schemes; in most URIs encountered in day to day use, this is just the host name, but it may also include user information and a port number.
path: The data (usually hierarchically organized) which serves (along with the query) to identify the resource in question, within the scope of the URI scheme and the naming authority. Within the path, slash (“/”) is used to separate segments. (
query: Non-hierarchically organized data which serves (together with the path) to identify the resource, within the scope of the scheme and the naming authority.
fragment: Additional information used to identify a secondary resource by reference to a primary resource. (The primary resource may be taken to be the one identified by the URI minus the fragment identifier, but RFC 3986 does not seem to say this explicitly.)

Operations on the string buffers and URI components

Analyse URI (reference) into components

Several operations analyse the string as a URI reference (or as a URI) and identify its various component parts. Different operations are provided for different grammars or sets of rules; comparing the behvior of different sets of rules for the same input is one of the major purposes of this page.

RFC 3986 Appendix A grammar: Parse the string as a URI reference using the grammar of RFC 3986 Appendix A (as reduced to a regular expression by Dan Connolly. If the string does not match the grammar for URI references, all of the component fields will be reduced to empty strings. A second button uses this grammar to analyse th string as a URI, not a URI reference. N.B. All valid URIs are valid URI references, but not vice versa.
RFC 3986 Appendix B regex: Parse the string as a URI reference using the (non-validating) regular expression of RFC 3986 Appendix B (translated into JavaScript). Populate the component fields accordingly.
HTML 5 rules: Parse the string as a URI reference using the grammar of RFC 3986 with modifications as specified in the current editor's draft of HTML 5.
WAH5: Analyse the string as a URI reference using the rules specified in the draft specification "Web addresses in HTML 5" (draft of 17 March 2009).

Finding ‘interesting’ strings

The “Find 'interesting' string button uses one or more random-string generators to try to find a string on which different sets of rules provide different results. Random strings are generated until an ‘interesting’ string is found, or until a maximum number of attempts is made (currently 1000 per click). An ‘interesting’ string, for purposes of this page, is one which is accepted by some sets of rules and rejected by others, or for which different sets of rules provide different analyses of the URI reference into scheme, authority, path, query, and fragment.

Rules: Select two or more sets of rules you want to compare.
Generators: Select one or more random generators to use. (To concentrate on differences other than ASCII vs. non-ASCII characters, use the printable-ASCII generator only.)
Differences: If “any difference” is checked, any difference in any property of the analysis will make the string interesting. If “validity only ” is checked, only a difference in whether the string is recognized at all will make the string interesting. (This will become more important when IRI-handling is added to the page.)

Handling internationalized resource identifiers (IRIs)

RSN.

Whitespace handling

Normalize-space: Normalize whitespace in the string following the definition of the normalize-space() function in XPath 1.0.
Strip leading and trailing whitespace: Remove whitespace characters from the beginning and end of the string. (“Whitespace here means the space, tab, carriage return, linefeed, form-feed, and vertical tab characters — other characters like U+2000 “en quad” or U+2009 “thin space” are not affected/
Strip leading whitespace: Remove whitespace characters from the beginning of the string.
Strip trailing whitespace: Remove whitespace characters from the beginning and end of the string.

References, related work

This page focuses on the minutiae of URIs; anyone interested in such minutiae may well be interested in the related information listed here.

HTML 5

Ian Hickson, ed., "HTML 5: A vocabulary and associated APIs for HTML and XHTML", W3C Working Draft 12 February 2009. <http://dev.w3.org/html5/spec/Overview.html> Also <http://www.whatwg.org/specs/web-apps/current-work/multipage/> and <http://www.whatwg.org/specs/web-apps/current-work/>

A major revision of the HTML vocabulary, attempting to align the spec to the behavior of current browsers.

RFC 2234

D. Crocker, ed., and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. <http://www.ietf.org/rfc/rfc2234.txt>

Defines the grammatical formalism used by RFC 3986 and RFC 3987 in specifying the rules for URIs and IRIs.

RFC 3490

P. Faltstrom, P. Hoffman, and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. <http://www.ietf.org/rfc/rfc3490.txt>

An effort to make it possible for Internet host names to be written using characters outside the US ASCII character set.

RFC 3986

T. Berners-Lee, R. Fielding, and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", RFC 3986, January 2005. <http://www.ietf.org/rfc/rfc3986.txt>

The current version of the authoritative defining document for URis.

RFC 3987

M. Duerst and M. Suignard, "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005. <http://www.ietf.org/rfc/rfc3987.txt>

The current version of the authoritative defining document for IRis (URIs for the whole world, even for the parts of the world which don't speak English all the time).

Unicode Converter

Richard Ishida, "Unicode Code Converter v6" <http://people.w3.org/rishida/scripts/uniview/conversion> and <http://rishida.net/scripts/uniview/conversion>

One of a set of immensely helpful Unicode-related utilities. The code on this page that handles UTF-8 and UTF-16 is based (with permission) on Richard Ishida's work.

URI Syntax tinkering

Dan Connolly, "URI Syntax tinkering", <http://homer.w3.org/~connolly/projects/urlp/raw-file/tip/tinker.html>

A tool for exploring the syntax of URIs and other constructs defined using ABNF. Reads ABNF from a text widget, generates a regular expression from the grammar, and uses the regular expression to analyse user-specified strings. Has buttons for loading the ABNF of RFC 3986 and the so-called "ABNF Core"; for other ABNF grammars you're on your own. (In particular, note that Dan's code doesn't handle absolutely every construct in ABNF, though it handles all the ones used by RFC 3986; also, there is no guarantee that the language defined by an ABNF can be reduced to a regular expression.)

The regular expressions used to analyse URIs on this page were generated by Dan Connolly's code.

XML Base

Jonathan Marsh and Richard Tobin, ed., "XML Base (Second Edition)", W3C Recommendation 28 January 2009. <http://www.w3.org/TR/xmlbase/>

Open problems and to-do list

The current form of this page is incomplete.

The analysers require that percent-escaped characters use upper-case hex, not lower-case. (Bug in the underlying regex generation code, owing to a perversity in ABNF.)

On the to-do list are:

Implement fully functionality sketched out here.
Allow the components to be specified individually, in some local encoding, and provide the operations necessary to construct URIs from them (transcoding to a public character encoding, reduction to URI characters, escaping reserved characters, and inserting delimiters as needed). This would serve to illustrate RFC 3986 section 2.5.

Possible follow-on work (if time allows):

Search automatically for strings which are accepted by Appendix B but not by Appendix A of RFC 3986.

Recognized / valid?
Scheme
Authority
Path
Query
Fragment

Analyse URI into components (description)	Analyse as URI reference using: Analyse as URI (more restrictive) using:
Find ‘interesting’ strings (description)	Select rules to compare: RFC3986 Appendix A / RFC3986 Appendix B / HTML 5 editor's draft Select string generators: Samples / printable ASCII / octets / ucs / URI bits and pieces / Check for (pick one): any difference / validity only /
Internationalized Resource Identifier (IRI) handling (description)	tbd
Space-handling (description)

URI desk calculator

19 March 2009 [under construction and subject to change]

URI buffer and components

Operations

Results