URI desk calculator
19 March 2009 [under construction and subject to change]
This page allows the user to test the effect of various
string functions used in URI creation, processing, or interpretation.
Its initial intention is to help debug the atomic pieces of
what will become more elaborate processing routines, but it
may also be helpful in exploring the consequences of various design
decisions in URI-related specifications.
For simple calculations involving numbers, it's convenient to have a
calculator on your desk, preferably one with a paper tape to record the
input data, the operations, and the results. This page provides
something roughly analogous, for operations involving URIs. There is
a buffer containing a string, on which various operations can be performed,
including analysing the string (if it's a legal URI reference) into
its component parts. And at the bottom, there is a log of the data,
the operations performed, and their results. It's not quite a paper
tape, but it's something.
Notes
N.B. at the moment, not all of these operations have yet been implemented; be patient.
The string and hex buffers
- Character buffer
- This is a normal HTML text input field.
- Hex buffer
- This hex buffer contains the UTF-8 equivalent of the character buffer, with
extra whitespace added for legibility. The equivalence is as calculated by
Richard Ishida's Unicode Code
Converter (v6).
Populating the buffers
You can type data into the string and hex buffers directly, or generate
test data in a variety of ways.
- Select random sample URI
- Populate the string buffer with a URI or URI reference randomly selected from examples
gathered from the text of RFC 3986 (and eventually from other sources).
Some examples may be intended by the specs as bad examples.
Others may be illustrating syntax problems; for this reason the examples are
not guaranteed valid. (If we were being really careful here,
we wouldn't call them URI references, for this reason.)
- Generate pseudo-random strings
- Populate the buffers with a string generated using a
pseudo-random-number generator. Several generators are available:
- Printable ASCII characters
- Characters in the range U+0020 .. U+007F.
- Octets
- Octets in the range 0 .. 255.
- UCS characters
- Characters from the Universal Character Set defined by ISO 10646 and
Unicode. The surrogate code points are excluded, because they
are an encoding artifact and do not represent characters.
- URI fragments
- The string is constructed by concatenating
randomly selected characters and strings matching the various
terminal and non-terminal productions of the URI grammar of RFC 3986.
This generates slightly more ‘interesting’ test cases
than the other random-string generators.
The length is randomly selected.
The length of these strings is also randomly selected, too.
The find interesting string operations use
the random string generators just listed to create strings which
are tested for certain properties.
The URI components
To illustrate the process of
analysing URIs and URI references
into their component parts, separate text widgets are provided
for several of the important components of the URI:
- scheme
- The URI scheme (http, ftp, file, telnet, ...).
- authority
- The hierarchically organized naming authority included by
some URI schemes; in most URIs encountered in day to day
use, this is just the host name, but it may also include
user information and a port number.
- path
- The data (usually hierarchically organized) which serves (along
with the query) to identify the resource in question, within
the scope of the URI scheme and the naming authority.
Within the path, slash (“
/
”) is used
to separate segments.
(
- query
- Non-hierarchically organized data which serves (together
with the path) to identify the resource, within the scope
of the scheme and the naming authority.
- fragment
- Additional information used to identify a secondary resource
by reference to a primary resource. (The primary resource
may be taken to be the one identified by the URI minus the fragment
identifier, but RFC 3986 does not seem to say this explicitly.)
Operations on the string buffers and URI components
Analyse URI (reference) into components
Several operations analyse the string as a URI reference (or as a URI) and
identify its various component parts. Different operations are provided
for different grammars or sets of rules; comparing the behvior of
different sets of rules for the same input is one of the major purposes
of this page.
- RFC 3986 Appendix A grammar
- Parse the string as a URI reference
using the grammar of RFC 3986 Appendix A (as
reduced to a regular expression by
Dan Connolly.
If the string does not match the grammar for URI references, all of the component fields will be reduced
to empty strings.
A second button uses this grammar to analyse th string as a URI, not a URI reference.
N.B. All valid URIs are valid URI references, but not vice versa.
- RFC 3986 Appendix B regex
- Parse the string as a URI reference using the (non-validating)
regular expression of RFC 3986 Appendix B (translated
into JavaScript). Populate the component fields accordingly.
- HTML 5 rules
- Parse the string as a URI reference using the grammar of RFC 3986 with
modifications as specified in
the current editor's draft of HTML 5.
- WAH5
- Analyse the string as a URI reference using the rules specified in the draft specification
"Web
addresses in HTML 5" (draft of 17 March 2009).
Finding ‘interesting’ strings
The “Find 'interesting' string button
uses one or more random-string generators to try to find
a string on which different sets of rules provide
different results.
Random strings are generated until an ‘interesting’ string is found,
or until a maximum number of attempts is made (currently 1000 per
click). An ‘interesting’ string, for purposes of this
page, is one
which is accepted
by some sets of rules and rejected by others, or
for which different sets of rules provide different
analyses of the URI reference into scheme, authority,
path, query, and fragment.
- Rules
- Select two or more sets of rules you want to compare.
- Generators
- Select one or more random generators to use. (To concentrate
on differences other than ASCII vs. non-ASCII characters, use
the printable-ASCII generator only.)
- Differences
- If “any difference” is checked, any difference in any
property of the analysis will make the string interesting.
If “validity only ” is checked, only a difference in whether the
string is recognized at all will make the string interesting.
(This will become more important when IRI-handling is added to the page.)
Handling internationalized resource identifiers (IRIs)
RSN.
Whitespace handling
- Normalize-space
- Normalize whitespace in the string following the definition of the
normalize-space()
function in XPath 1.0.
- Strip leading and trailing whitespace
- Remove whitespace characters from the beginning and end of the string.
(“Whitespace here means the space, tab, carriage return, linefeed, form-feed, and vertical tab
characters —
other characters like U+2000 “en quad” or U+2009 “thin space” are
not affected/
- Strip leading whitespace
- Remove whitespace characters from the beginning of the string.
- Strip trailing whitespace
- Remove whitespace characters from the beginning and end of the string.
References, related work
This page focuses on the minutiae of URIs; anyone interested in such
minutiae may well be interested in the related information listed here.
- HTML 5
- Ian Hickson, ed.,
"HTML 5:
A vocabulary and associated APIs for HTML and XHTML",
W3C Working Draft 12 February 2009.
<http://dev.w3.org/html5/spec/Overview.html>
Also
<http://www.whatwg.org/specs/web-apps/current-work/multipage/>
and
<http://www.whatwg.org/specs/web-apps/current-work/>
A major revision of the HTML vocabulary, attempting to
align the spec to the behavior of current browsers.
- RFC 2234
- D. Crocker, ed., and P. Overell,
"Augmented BNF for Syntax Specifications: ABNF",
RFC 2234, November 1997.
<http://www.ietf.org/rfc/rfc2234.txt>
Defines the grammatical formalism used by RFC 3986
and RFC 3987 in specifying the rules for URIs and IRIs.
- RFC 3490
- P. Faltstrom, P. Hoffman, and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003.
<http://www.ietf.org/rfc/rfc3490.txt>
An effort to make it possible for Internet host names
to be written using characters outside the US ASCII character
set.
- RFC 3986
- T. Berners-Lee, R. Fielding, and L. Masinter,
"Uniform Resource Identifier (URI): Generic Syntax",
RFC 3986, January 2005.
<http://www.ietf.org/rfc/rfc3986.txt>
The current version of the authoritative defining document
for URis.
- RFC 3987
- M. Duerst and M. Suignard,
"Internationalized Resource Identifiers (IRIs)",
RFC 3987, January 2005.
<http://www.ietf.org/rfc/rfc3987.txt>
The current version of the authoritative defining document
for IRis (URIs for the whole world, even for the parts
of the world which don't speak English all the time).
- Unicode Converter
- Richard Ishida,
"Unicode Code Converter v6"
<http://people.w3.org/rishida/scripts/uniview/conversion>
and <http://rishida.net/scripts/uniview/conversion>
One of a set of immensely helpful Unicode-related
utilities. The code on this page that handles UTF-8
and UTF-16 is based (with permission) on Richard Ishida's work.
- URI Syntax tinkering
- Dan Connolly,
"URI Syntax tinkering",
<http://homer.w3.org/~connolly/projects/urlp/raw-file/tip/tinker.html>
A tool for exploring the syntax of URIs and other constructs
defined using ABNF. Reads ABNF from a text widget, generates
a regular expression from the grammar, and uses the regular
expression to analyse user-specified strings. Has buttons for
loading the ABNF of RFC 3986 and the so-called "ABNF Core"; for
other ABNF grammars you're on your own. (In particular, note
that Dan's code doesn't handle absolutely every construct in ABNF,
though it handles all the ones used by RFC 3986; also, there is
no guarantee that the language defined by an ABNF can be
reduced to a regular expression.)
The regular expressions used to analyse URIs on this page
were generated by Dan Connolly's code.
- XML Base
- Jonathan Marsh and Richard Tobin, ed.,
"XML Base (Second Edition)",
W3C Recommendation 28 January 2009.
<http://www.w3.org/TR/xmlbase/>
Open problems and to-do list
The current form of this page is incomplete.
The analysers require that percent-escaped characters use upper-case hex,
not lower-case. (Bug in the underlying regex generation code, owing to
a perversity in ABNF.)
On the to-do list are:
- Implement fully functionality sketched out here.
- Allow the components to be specified individually, in
some local encoding, and provide the operations necessary
to construct URIs from them (transcoding to a public
character encoding, reduction to URI characters,
escaping reserved characters, and inserting delimiters as needed).
This would serve to illustrate RFC 3986 section 2.5.
Possible follow-on work (if time allows):
- Search automatically for strings which are accepted by Appendix B but not by Appendix A of RFC 3986.
Copyright © 2009 World Wide Web Consortium,
(Massachusetts Institute of Technology,
European Research Consortium for Informatics and Mathematics,
Keio University).
All Rights Reserved.
Governed by the
W3C document license.
This page and the Javascript code embedded here
includes some material by others.
That material is, respectively,
copyright © 2006-2007 Ian Bicking;
copyright © 2000-2009 Richard Ishida;
copyright © 2009 Black Mesa Technologies LLC.
Used by permission of the copyright holders.