How many attributes?

[9 March 2011]

Beginning programmers are often taught to avoid arbitrary limits in their software, but in reality it’s not unusual for programmers to write capacity limits into their software: fixed-length fields are often easier to deal with than variable-length fields. So we have fixed-length IP numbers, fixed-length addresses in most machines, and who knows what all else.

So I wasn’t surprised today when I found a hard limit to the number of attributes allowed on an XML element in an XQuery implementation I use and think highly of.

But every now and then, the fixed limit turns out to be too small. That’s one reason beginners are instructed to treat such limits with suspicion.

When I first learned about IP numbers, for example, I remember being told there were enough IP numbers for everyone in the world to have one, including all the people in China. That seemed an extravagant plenitude of IP numbers, partly because we didn’t really expect everyone in China to want one. We didn’t expect everyone in the U.S. to want one, either: at the time, the only computer anyone in the room used for network activities was a shared mainframe, so we needed far fewer than one IP number per person then: looking ahead to a time when people would want their microcomputers to be on the internet felt like being farsighted. We did not foresee (at least, I certainly didn’t) that individual machines without a human in attendance might also want IP numbers, so that the world would need more than one per human. So IPv6 became necessary. (If you don’t know what I’m talking about, don’t worry: sometimes limits that look reasonable at the outset turn out to be too low, especially if the system they are built into becomes highly popular.)

Memory addresses of 32 bits also seemed extravagant for a long time (that mainframe I worked on supported several hundred simultaneous users in a 24-bit address space; the actual hardware supported 31-bit addresses, but only specialized parts of the operating system used more than 24 bits). But nowadays more and more machines now are moving to 64-bit addresses.

So I also wasn’t surprised that the reason I learned about the hard-coded limit in the XQuery engine was that it turned out to be set too low. The software declined to handle some XML I need to work with for a client’s project. It turns out that current versions of the Unicode Character Database are available in XML, which makes my life a lot easier. But most of the many character properties defined in the Unicode Database are represented as attributes, which blew well past the limit of (are you sitting down? ready?) 32.

I was a little surprised by the actual value of the limit: 32 seems like a very low number for such a limit. Who in their right mind (I found myself thinking) would expect a limit like that to suffice for real work? Who, indeed? But upon reflection, I remembered that I’ve been using this XQuery engine without incident for at least two or three years, doing a good deal of real work. And all the XML I’d dealt with in that time came in under the limit. So while 32 seems like a low limit, it seems to have been fairly well chosen, at least for the work I’ve been doing.

The happy ending came when I wrote to the support list asking if there were some way to change the configuration to lift the limit. Less than ten minutes later, a reply was in my inbox (from the indefatigable Christian Grün, who must work very late hours to have been at his desk at that time of the night) saying that in current versions of the software (BaseX, for those who are curious, one of a number of excellent XQuery implementations available today), the limit has been changed to 231, so now it can handle elements with a little over two billion attributes. (Hmm. Will that do? Well, let’s put it this way: if I wanted to experiment with a restructuring of the Unicode database that had one element per character property or property value, and a boolean attribute for each character indicating whether that character had that property [or that value for the property], the software could handle that many attributes. Actually, it could handle about a thousand times that many.

Moral: it’s not necessarily an error when software has a fixed capacity limit. But as a user, you normally need to take care that the limits are appropriate to your needs.

Moral 2: when you do bump your head on a limit of this kind, it’s very handy if those responsible for the software are responsive to user queries. Even better, of course, if they turn out to have fixed the problem before you asked about it.

XML Prague 2011

[17 February 2011]

XML Prague is just over a month away. I’ll be there again this year, the organizers having once more generously invited me to provide some closing remarks at the end of the conference.

I’d urge anyone within easy travel distance of Prague to plan to attend, but it turns out the conference is booked to capacity already. So there’s nothing to do, if you haven’t already registered, but wait ’til next year. (Well, that’s not really true. XML Prague provides live streaming video, which means you may be able to watch some of the talks even if you can’t be in the room.)

The theme this year is Web and XML; the suggested topics include speculation on why XML never became the usual way to prepare Web pages and the relations between XML and HTML 5 and between XML and JSON. The latter two, at least, seem designed to provoke some bottle-throwing; maybe they will succeed.

The papers include one by Alain Couthures on JSON support in XForms and one by Jason Hunter on a JSON facade for Mark Logic Server; this suggests that at least some XML users plan to co-exist with JSON by the expedient of making XML tools present and work with JSON as if it were XML. When one has to deal with information provided by others which is available only in JSON form and not in XML, it will be handy to view the JSON information through XML lenses.

Other papers describe XQuery in the browser (by way of a Javascript implementation from ETH Zurich, created by compiling the Java-based MXQuery engine into Javascript using Google’s Web Toolkit), XSD in the browser (from the University of Edinburgh), and XSLT 2.0 in the browser (from Saxonica), as well as a general consideration of XML processing in the browser (Edinburgh again). Some papers are about XML and information processing outside the browser: one team is translating SPARQL into XQuery, and Uche Ogbuji of Zepheira is presenting the Akara framework under the title “Spicy Bean Fritters and XML Data Services”, which makes me eager to to go have some spicy bean fritters (figurative or literal).

There are other papers on EPUB, on electronic Bibles, on XQuery optimization, and on a variety of specific applications, projects, and tools.

It should be fun. If you’ll be there, I look forward to seeing you there; if you won’t be there this year, you might sample the conference using the video feed (some people I know turn off the video as distracting and just listen to the audio, which takes less bandwidth). And if not this year, then perhaps next year.

Why do some XML-oriented users avoid Cocoon 2.2?

[24 January 2011]

Over time, I’ve become aware that I’m not the only user of Cocoon 2.1 who has not yet moved to Cocoon 2.2.

In my case, the basic story is simple. When I first considered installing Cocoon 2.2, I expected installation to be very similar to installation of Cocoon 2.1; it is after all a decimal-point release. So I looked for a zip or tgz file to download and couldn’t find one. A little puzzled, I read the getting-started documentation, which informed me to my dismay that the first thing I had to do was install a particular Java build manager and start building an extension to Cocoon. (I have nothing whatever against Java build managers in general or Maven [the build manager in question] in particular. It’s just that the large majority of lines of code I’ve written in the last ten years are in XSLT, with Prolog and XQuery a distant second and third, with Common Lisp, Emacs Lisp, C, Java, Rexx, and other languages bringing up the rear like the peloton in a bicycle race. So no, I didn’t have Maven installed and didn’t have it on my list of things to do sometime soon or ever.) Now, I like Cocoon a fair bit and moving to 2.2 still seemed like a good idea, so I got as far as downloading Maven and working through its five-minute introduction before I went back to the Cocoon 2.2 intro and learned that the first thing I was to do was develop a Java extension to Cocoon. That was when I lost patience, said “This is nuts” and re-installed Cocoon using an old Cocoon 2.1 .war file.

Some months later I thought about it again and decided I should give it another try. I did. And essentially the same thing happened again.

In the time since then I’ve encountered two or three other XML-oriented people who have told me similar stories as explanations of why they are still using Cocoon 2.1.

Recently I’ve come to believe that the problem here (if it’s a problem — and as a Cocoon 2.1 user, I think it is) is simple: the Cocoon 2.2 documentation is written (I guess) by people who think of themselves in some primary sense as Java programmers, and they have written it (not surprisingly, if in this case perhaps not quite reasonably) for people much like themselves: i.e. people who want to use Cocoon as a framework to write and deploy Java code and/or to extend Cocoon. There is no highly visible documentation for Cocoon 2.2 aimed at people who want to use Cocoon’s out-of-the-box features to create XML-based web sites where all the heavy lifting is handled by XSLT transformations under the control of Cocoon pipelines, and who are more likely to be interested in writing XSLT than Java. Me, I got interested in Cocoon precisely because I could do nice things without writing Java code for it. I am happy to know, and occasionally to be reminded, that I can extend Cocoon if I ever need to by writing Java code; I’ve done that in the past and I expect to do it in the future.

I think Cocoon 2.2 could get better uptake among XML-oriented users if there were some highly visible documentation aimed at that demographic. It might also help if there were a document on “Cocoon 2.2 installation for Cocoon 2.1 users” to explain that while Maven is indeed targeted at Java developers, you really don’t need to be a Java developer to find it useful: you can just think of it as a Java-specific package and dependency manager or a much-smarter FTP client specialized for downloading and installing Cocoon in just the way you want it.

More on this topic later.

XForms class 14-15 February 2011, Rockville, Maryland

[5 January 2011; typo corrected 24 Jan 2011]

Black Mesa Technologies has scheduled a two-day hands-on class on the basics of XForms, to be taught 14-15 February 2011 in Rockville, Maryland, in the training facility of Mulberry Technologies (to whom thanks for the hospitality).

The course will cover the XForms processing model, the treatment of simple values and the creation of simple structures, repetitions of the same element, sequences of heterogeneous elements, and techniques for using XForms for complex forms, dynamic interfaces, and multilingual interfaces. It’s based on the one-and-a-half day course given last November at the TEI Members Meeting in Zadar, Croatia, which (judging by the participants’ evaluations) was a success.

XForms have great potential for individuals, projects, and organizations using XML seriously: XForms is based on the model / view / controller idiom, and the model in question is represented by a set of XML documents. That means that you can use XForms to create specialized editing interfaces for XML documents, which exploit the styling and interface capabilities of the host language (typically XHTML) and can also exploit your knowledge of your own data and requirements.

Some people have built more or less general-purpose XML editors for specific vocabularies using XForms. That works, I think, more or less, though in many cases I think you’ll get better results acquiring a good XML editor and learning to use it. XForms really shines, I think, in the creation of ad hoc special-purpose editors for the performance of specialized tasks.

In many projects, XML documents are created and refined in multiple specialized passes. In a historical documentary edition, the document will be transcribed, then proofread and corrected in multiple passes performed by different people, or pairs of people. Another pass over the document will mark places where annotations are needed and note the information needed to write those annotations. (“Who is this person mentioned here? Need a short biographical note.”) And so on.

In a language corpus, an automated process may have attempted to mark sentence boundaries, and a human reviewer may be assigned to correct them; the only things that reviewer is supposed to do are open the document, split all of the s elements where the software missed a sentence boundary, join adjacent s elements wherever the software was wrong in thinking it had found a sentence boundary, save the document, and quit. If you undertake this task in a full XML editor, and you get bored and lose concentration, there is essentially no limit to the amount of damage you could accidentally do to the data by mistake. What is needed for situations like this is what Henry Thompson of the University of Edinburgh calls ‘padded-cell editors’ — editors in which you cannot do all that much damage, precisely because they are not full-featured general-purpose editors. Because they allow the user to do only a few things, padded-cell editors can have simpler user interfaces and be easier to learn than general-purpose editors.

The construction of padded-cell editors has always been a complicated and expensive task; it’s going to take thousands, or tens of thousands, of lines of Java or Objective C or Python to build one, even if you have a reasonably good library to use. With XForms, the high-level abstractions and the declarative nature of the specification make it possible to do roughly the same work with much less code: a few hundred lines of XHTML, CSS, and XForms-specific markup.

This is why I think XForms has a place in the toolkit of any project or organization making serious use of XML. And, coincidentally, it may be a reason you, dear reader, or someone you know may want to attend this XForms course.

(Oh, yes, one more thing: we have set up an email announcement list for people who want to receive email notification of this and other courses organized or taught by Black Mesa Technologies; a sign-up page is available.)

What constitutes successful format conversion?

[31 December 2010]

I wrote about the International data curation conference earlier this month, but did not provide a pointer to my own talk.

My slides are on the Web on this site; they may give some idea of the general thrust of my talk. (On slide 4, “IANAPL” expands to “I am not a preservation librarian”. On slide 20, the quotation is from an anonymous review of my paper.)

Over time, I become more and more convinced that formal proofs of correctness are important for things we care about. The other day, for example (29 December to be exact), I saw a front-page article in the New York Times about radiation overdoses resulting from both hardware and software shortcomings in the device used to administer radiotherapy. I found it impossible not to think that formal proofs of correctness could help prevent such errors. (Among other things, formal proofs of correctness force those responsible to say with some precision what correct behavior is, for the software in question, which is likely to lead to more explicit consideration of things like error modes than might otherwise happen.)

Formal specification of the meaning of markup languages is only a smaller part of making possible formal proofs of system correctness. But it’s a step, I think.

International Data Curation Conference (IDCC) 6, Chicago

[13 December 2010]

Spent the early part of last week in Chicago attending the 6th International Digital Curation Conference, co-sponsored by the U.K.-based Digital Curation Centre and the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign, “in partnership with” the Coalition for Networked Information (CNI).

I won’t try to summarize everything here, just mention some talks that caught my attention and stick in my mind.

The opening keynote by Chris Lintott talked about his experiences setting up the Galaxy Zoo, an interactive site that allows users to classify galaxies in the Sloan Sky Map by their form (clockwise spiral, counter-clockwise spiral, etc.). At the outset I was deeply skeptical, but he won me over by his anecdotes about some surprising results achieved by the mass mobilization of these citizen scientists, and by saying that if you want that kind of thing to work, you must treat the users who are helping you as full partners in the project: you must describe accurately what you are doing and how they are helping, and you must share the results with them, as members of the project team. The Galaxy Zoo has been such a success that they are now building an infrastructure for more projects of the same kind (essentially: where humans do better than software at recognizing patterns in the data, and where it’s thus useful to ask humans to do the pattern recognition on large datasets), called the Zooniverse.

Those with projects that might usefully be crowd-sourced should give the Zooniverse a look; it might make it feasible to do things you otherwise could not manage. (I wonder if one could get better training data for natural-language parsers and part of speech taggers that way?)

In the immediately following session, Antony Williams of the ChemSpider project described the somewhat less encouraging results of a survey of Web information about chemistry, from the point of view of a professional chemist who cares about accuracy (and in particular cares about stereochemical details). Barend Mons gave an optimistic account of how RDF can be used not just to describe Web pages but to summarize the information they contain, sentence by sentence, and the massive triple stores that result can be used to find new interesting facts. It was all very exciting, but his examples made me wonder whether you can really reduce a twenty-five- or forty-word sentence to a triple without any loss of nuance. In the question session, Michael Lesk asked an anodyne question about Doug Lenat and the Cyc project, which made me think he was a little skeptical, too. But the speaker dodged the bullet (or tried to) by drawing firm lines of difference between his approach and Cyc. John Unsworth rounded out the session by describing the MONK (Metadata offer new knowledge) project and making the argument that humanities data may resist correction and normalization more violently than scientific data, the idiosyncrasy of the data being part of the point. (As Jocelyn Penny Small of Rutgers once said in an introduction to databases for humanists, “Your job is not to use your computer database to clean up this mess. Your job is to use the computer and the database to preserve the mess in loving detail.”)

Another session that sticks in my mind is one in which Kate Zwaard of the U.S. Government Printing Office spoke about the GPO’s design of FDSys, intended to be a trusted digital repository for (U.S. federal) government documents. In developing their metadata ontology, they worked backwards from the required outcomes to identify the metadata necessary to achieve those outcomes. It reminded me of the old practice of requirements tracing (in which every feature in a software design is either traced to some accepted requirement, or dropped from the design as an unnecessary complication). In the same session, Michael Lesk talked about work he and others have done trying, with mixed success, to use explicit ontologies to help deal with problems of data integration — for example, recognizing all the questions in some database of opinion-survey questions which are relevant to some user query about exercise among the elderly. He didn’t have much of an identifiable thesis, but the way he approached the problems was almost worth the trip to Chicago by itself. I wish I could put my finger on what makes it so interesting to hear about what he’s done, but I’m not sure I can. He chooses interesting underlying questions, he finds ways to translate them into operational terms so you can measure your results, or at least check them empirically, he finds large datasets with realistic complexity to test things on, and he describes the results without special pleading or excuses. It’s just a pleasure to listen to him. The final speaker in the session was Huda Khan of Cornell, talking about the DataStaR project based there, in which they are using semantic web technologies in the service of data curation. Her remarks were also astute and interesting; in particular, she mentioned that they are working with a tool called Gloze, for mapping from XML document instances into RDF; I have got to spend some time running that tool down and learning about it.

My own session was interesting, too, if I say so myself. Aaron Hsu of Indiana gave an energetic account of the difficulties involved in providing emulators for software, with particular attention to the challenges of Windows dynamic link libraries and the particular flavor of versioning hell they invite. I talked about the application of markup theory (in particular, of a particular view about the meaning of markup) to problems of preservation — more on that in another post — and Maria Esteva of the University of Texas at Austin talked about visualizations for heterogeneous electronic collections, in particular visualizations to help curators get an overview of the collection and its preservation status, so you know where it’s most urgent to focus your attention.

All in all, a good conference and one I’m glad I attended. Just two nagging questions in the back of my mind, which I record here for the benefit of conference organizers generally.

(1) Would it not be easier to consult the program on the Web if it were available in HTML instead of only in PDF? (Possibly not; possibly I am the only Web user left whose screen is not the same shape as a piece of A4 paper and who sometimes resizes windows to shapes different from the browser default.)

(2) How can a conference on the topic of long-term digital preservation choose to require that papers be submitted in a closed proprietary format (here, .doc files), instead of allowing, preferring, or requiring an open non-proprietary format? What does this say about the prospects for long-term preservation of the conference proceedings? (A voice in my ear whispers “Viable just as long as Microsoft thinks the current .doc format is commercially a good idea”; I think IDCC and other conferences on long-term preservation can and should do better than that.)

Best Practices Exchange 2010

[8 October 2010]

Last week nearly two hundred archivists and librarians gathered for Best Practices Exchange 2010 in Phoenix. BPEx is described by its chair, Richard Pearce-Moses, as an ‘unconference’, and he worked hard to encourage sharing of information not just from presenters to audience but vice versa.

Most participants were from state archives, state libraries, local government (e.g. county clerks offices), federal agencies, or large research institutions; as an IT consultant, I was one of a few exceptions. Many people knew each other already from participation in joint projects (past or present), but it was as friendly and welcoming a group for newcomers as I have encountered. I learned a lot about what our public memory institutions are doing about digital preservation (short answer: the situation is fluid, lots of people are aware of the problem, lots of people are working to develop and elaborate solutions, and so far there are no simple answers) and what the situation looks like from the point of view of a state archivist or librarian, charged by law with preserving certain records or publications of government for a fixed term or in perpetuity. (Why do images of tsunamis and avalanches keep coming to my mind?)

I don’t have time for a full trip report right now, but among the highlights of the meeting I have to mention the pre-conference workshop on digital preservation management, taught by Nancy McGovern of ICPSR (the Inter-University Consortium for Political and Social Research) — it’s nice to see the social sciences getting some credit for their decades of experience with digital preservation! Going through what is essentially a three-to-five-day workshop in a single day felt like drinking from a fire hose, but she managed to keep things clear and make the day a pleasure.

And David Ferriero, the Archivist of the United States, gave a keynote talk which was substantive, candid without being indiscreet, and occasionally quite funny. I had had low expectations coming in, so his talk was a very pleasant surprise.

For anyone interested in digital preservation, this is a great gathering. It is to be hoped that someone will volunteer to host another Best Practices Exchange next year; if they do, I will certainly do my best to get there!

XML for the Long Haul

[9 June 2010]

The preliminary program for the one-day symposium on XML for the Long Haul (2 August 2010 in Montréal) is up and on the Web. Actually, it’s been up for a while, but I’ve been very busy lately and have had no time for blogging. (I know it seems implausible, but it’s true.)

The preliminary program has a couple slots left open for invited talks, which aren’t ready to be announced yet, but even in its current form it looks good (full disclosure: I’m chairing the symposium and made the final decisions on accepting papers for the symposium): we have two reports from major archives (Portico [Sheila Morrissey and others] and PubMed Central [Jeff Beck]) which face many of the same problems but take somewhat different approaches to addressing them. We have a retrospective report from a group of authors involved in a multi-year German project on the sustainability of linguistic resources [Georg Rehm and others]; the project has wound down now and I am hoping that the authors will be able to give a useful summary of its results.

Josh Lubell of NIST will talk about the long-term preservation of product data; for certain kinds of products (think Los Alamos and Oak ridge) the longevity of that information is really, really important to get right. And Quinn Dombrowski and Andrew Dombrowski of the University of Chicago shed an unexpected light on the problem of choosing archiveal data formats, by applying Montague semantics (a very interesting method of assigning meaning to utterances, usually applied to natural-language semantics and thus a little unexpected in the markup-language context) to the problem.

And the day concludes with Liam Quin of W3C providing a very cogent high-level survey of long-term preservation as an intellectual problem, and drawing out some consequences of the issues raised.

As is usual at Balisage symposia, we have reserved ample time for discussion and for an open session (aka free-for-all) at the end of the day.

As you can see, the program examines the problem are from a variety of perspectives and provides ample opportunity for people who might not often hear from each other to exchange views and learn from experience in other fields.

If you have any interest at all in long-term preservation of information (for whatever value of “long-term” makes sense in your context), you should plan to be in Montréal on 2 August. See you there!

The view from Black Mesa

[2 June 2010]

Black Mesa is a volcanic outcropping just north of San Ildefonso Pueblo in northern New Mexico. (The name “Black Mesa” is used of a bewildering variety of geographic features in the southwest, including two that can be seen from my house. The Black Mesa for which this blog is named is the one on San Ildefonso land right beside the Rio Grande.)

Because Black Mesa lies just outside my office window, two or three miles distant to the south, and because I find its profile beautiful, I spend a lot of time looking at it when I’m thinking about things and trying to get my ideas clear.

It’s clear that information technology has turned many things upside down for libraries, archives, museums, and others interested in preserving and providing access to information. It seems to me, though, that it hasn’t turned everything upside down: many properties we associate with libraries and archives and museums reflect not the properties of pre-digital technology but the characteristics of the problems they are trying to solve. Which existing practices should be changed to exploit digital technology? Which should remain in place?

What is the right way to use digital technology in preserving, protecting, and providing access to non-commercial information, public information, literary texts, linguistic resources, and (for want of a less grandiloquent phrase) the cultural heritage of humanity?

Getting my ideas clear on those and related question is what this blog is for. I hope you enjoy it.

This blog is intended to provide a place to record some of those ideas.