Daring to Do Less with XML

May 2, 2001

But the lesson the Web teaches, reinforced by XML, is that the way forward lies in Daring To Do Less - Tim Bray, on the XML-DEV mailing list

Many observers have noted that the basic simplicity of XML is a fundamental reason for its rapid acceptance in electronic business. The original XML specification can be printed on about 40 pages, compared to the more than 400-page SGML specification from which it is derived. This makes it relatively easy to implement an XML parser that processes input text and makes the XML structures encoded in that text available to ordinary software. On the other hand, XML has a number of quite notable limitations in its original form. For example, the DTD schema language in the XML Recommendation is too limited for many business purposes because it has little conception of datatypes that are used in almost all programming languages and e-business applications.

Perhaps the most visible approach to this problem, and the focus of the World Wide Web Consortium that has defined XML and many related standards, is to build new XML-related specifications that address the limitations. These efforts include the recent or emerging specifications for XML Namespaces, RDF, XSL, and XML Schemas. These efforts have brought additional functionality to the basic XML toolset, but they've also brought widespread criticism because of the difficult prose in which they are described and in the complexity of the underlying structures and operations that they define. There are public calls by popular writers on XML topics for a refactorization of the XML specifications ("acknowledging that it is hard to get things right the first time, and allowing changes in requirements"), as well as for a minimization of their growing interdependency. We've even seen a number of rudely named Web sites maintained by well-known XML developers, put up to provide public forums from which to discuss topics such as "XML Namespaces: Godsend or Demon Seed." Needless to say, these sites have not generated much public approval from the XML community, but it is important to emphasize that they do not oppose XML, but rather advocate "tough love" for it.

In short, the XML community faces something of a dilemma. It's the simplicity of the XML specification itself that has brought it such widespread acceptance in such a short time, but the accompanying lack of features leads to increased complexity of the XML family of specifications. This in turn leads to the backlash that we are now observing.

What can we, as developers and users of XML and XML-enabled products, do to effectively dodge both horns of this dilemma? I believe that the following rules of thumb can help you exploit the true power of XML while avoiding the tar pits.

Use the minimal subset of the XML specifications needed to do the job at hand.
If you are implementing a subset, call it a subset.
Be prepared to use simpler equivalents for the W3C specifications.
Vote with your feet.

Let's examine these suggestions in more detail and take a look at various actual specifications and bits of software that can help you follow them.

Wield Occam's Razor

For a number of reasons, the various XML specifications are not as consistent as one might hope. For example, there are subtle but annoying differences among XML syntax, the W3C DOM data model, and the XPath/XSLT data model with respect to their representation of the structure and content of an XML document. Likewise, there are a significant number of optional features in the XML specification itself and a separate XML Namespaces specification. Thus, it is highly possible for one tool that correctly implements one set of specifications to not interoperate easily with another that implements a slightly different set of specifications, even though both are correctly programmed. Likewise, the interaction between some XML specs (such as XML Namespaces and XML 1.0 DTDs) is simply undefined.

These interoperability problems can easily occur in practice, not just in theory. For example, element names ("tags") containing a colon, e.g. "htm:h1" are legal (albeit discouraged) in XML 1.0, but a namespace-aware parser will reject such uses of a namespace prefix without a namespace declaration, and "non-namespace well-formed" XML instances cannot be represented in the XML InfoSet, the basis for several forthcoming W3C specifications including XML Schema, XPath 2.0, and XSLT 2.0. Non-validating parsers may or may not expand external entity references that they encounter. The differences between the DOM and XPath data models have made it very difficult for the W3C DOM working group to add a standard interface for XPath (perhaps along the lines of Microsoft's selectNodes extension), even though there is a widespread desire to meet this obvious need.

These inconsistencies pose a genuine dilemma for standards writers and XML implementers, but there is an effective way for consumers of XML to avoid these complexities: Use the tried and true core of XML that is common to all of the XML-related specifications. Usage guidelines for this Common XML Core subset emerged from a long discussion on the XML-DEV and SML-DEV mailing list and are currently archived on Simon St. Laurent's web site. The Common XML Core refers to the "frequently used and thoroughly reliable subset of the features provided by the XML 1.0 and Namespaces in XML W3C Recommendations," and the pros and cons of using various additional features of XML are outlined as well.

Seriously minimalistic XML users might want to consider an even smaller subset -- Minimal XML. This began almost as a quasi-academic exercise to determine the bare minimum of XML syntax that one can imagine to be useful for data-oriented applications, rather than as a serious usage guideline. It only includes elements and text and excludes attributes, namespaces, mixed text content, DTDs, CDATA sections, entities, notations, comments, processing instructions, and even the XML declaration itself. Even so, Minimal XML has inspired a number of interesting efforts. For example, some have noted that it bridges the gap between well-understood Lisp and Prolog constructs and XML syntax:

40+ years of symbolic programming have shown the power and simplicity of the Lisp S-Expression, or almost equivalently, the Prolog term tree. ... The only good thing about XML, as far as I can tell, is what it shares with S-Expressions. Unfortunately, it's about 100 times more complicated. (I don't think I'm exaggerating.) Enter Minimal-XML. It throws out of XML essentially everything except the good stuff -- the S-Expression-nature. Mark Miller, on the SML-DEV mailing list

There are also practical applications: a useful parser for this subset of XML can be written in less than 50 lines of Javascript and a Java version can be small enough to fit in even the smallest devices. Furthermore, no XML expertise is needed to work with Minimal XML; the grammar has only 9 BNF productions, and I have seen a complete XML novice implement a parser in C in less than a week. (The same person went on to implement a validating XML + namespaces parser in about a year.) Specific implementations of minimal XML-like languages in Java include Min by Don Park, another by Shawn Silverman, the Twilight Minds Markup Language or TMML, and one for the TINY platform that implements a larger subset of XML.

Actually, most people who experiment with Minimal XML quickly find that it is easy to implement a larger subset of XML; attributes, comments, processing instructions, and DTD parsing pose few challenges; mixed content is not hard to parse but does complicate the underlying data model quite significantly. But this does not change the basic message of the XML simplification experiments. The true power of XML is the simple elegance at its center; those who stick to that simple core can achieve great results with minimal effort and still maintain compatibility with ordinary XML tools.

Keep it Simple, but Call a Subset a "Subset"

Perhaps this does not need to be said, but it is a rare mailing list discussion of XML simplification that does not generate some heated exchanges about standards compliance. Let me be perfectly clear: If you claim to support some XML specification in your product, that means you have to support the whole thing, even those features that complicate your life and don't make sense for your customers. You may not have any use for CDATA Marked Sections or Processing Instructions, but they are part of XML 1.0, so if you say you are conformant with XML 1.0, then you had better support them.. You may -- for some good business or technology reason -- choose to develop a product that supports some set of XML-like features other than those defined in the XML Recommendation, and put whatever marketing spin on this that you see fit, but please don't call it XML 1.0.

Interestingly enough, there is a procedure defined by ISO 8879 (SGML) Annex K that adds to the SGML declaration an optional keyword SEEALSO, which provides the URL of an additional requirements document in which any constraints above and beyond SGML can be noted. XML itself is defined using this procedure in Annex L. Similarly, as Rick Jelliffe has pointed out, this allows one to define Common XML, Minimal XML, or anything else that defines a legal subset of SGML as this kind of profile. So, any community that has a business need to agree on some subset of XML 1.0 (namespaces are presumably problematic here, since SGML does not have such a concept) can define it as a standard. That seems far preferable to implementing a subset and hiding that fact in the release notes.

Communities that need to work together to develop a shared XML-related specification can also do so under the umbrella of OASIS, which has recently established a process by which groups of members sharing some common vision can form Technical Committees to draft Committee Specifications. Those which turn out to be widely useful can eventually become OASIS Standards by a vote of OASIS members (with strict rules allowing a relatively small number of members to veto a Standard).

Simpler Alternative Specifications

An OASIS TC has already formed to address one of the thorniest areas of the XML standards landscape, an XML Schema specification. The TREX project aims to develop a schema language that is "simple, easy to learn, uses XML syntax, does not change the information set of an XML document, supports XML namespaces, treats attributes uniformly with elements so far as possible, has unrestricted support for unordered content, has unrestricted support for mixed content, has a solid theoretical basis, and can partner with a separate datatyping language."

There are often simpler alternatives to the complete W3C specifications in other areas. The most widely used is certainly SAX, a lightweight API for XML processing. SAX was devised by an ad hoc group on the XML-DEV mailing list in 1998. Other lightweight XML-related specs do come from standards organizations. For example, the ISO is working to approve a much simpler XML schema specification of its own called RELAX. This will be suitable for many who do not need all the power of the W3C schema language or may serve a stepping stone on the way to that destination for others.

There are a number of efforts to define simple APIs for XML, such as Electric XML (There is also the JDOM project in the Java Community Process, but this is an attempt to build an interface that is simpler for Java programmers to use than the language-neutral W3C DOM and not a minimalist effort.)

XSLT is another W3C specification often criticized for its complexity. I'm not aware of any efforts to do for transformations what RELAX or TREX does for schemas (perhaps because XSLT came out quickly, is widely supported today, and provides much useful benefit in return for its complexity), but there are several small XML scripting languages that at least claim to perform simple transformations more easily than is possible with XSLT. XMLScript is a simple programming language written in XML syntax that is designed to be used in data transformation applications. Matt Sergent's XPathScript is a stylesheet language with only a few features, but "with the power and flexibility of Perl, XPathScript is a very capable system." Paul Tchistopolskii's XSLScript, on the other hand, is a "terse notation for writing complex XSLT stylesheets" that is "compiled" into real XSLT.

Vote With Your Feet

There can be little doubt that the world would be a better place if one organization, perhaps the W3C, offered a complete suite of well-defined, cleanly integrated XML specifications with layered functionality, allowing users to select their own optimal features-complexity tradeoff. That is surely a world that many of us hope to live in someday, but it is definitely not today's world. I believe that achieving this vision is a task too important to be delegated to the expert committes at the standards bodies and will take active participation by the consumers of XML-related specifications and products. Collectively, all of us are the "invisible hand" of the marketplace that selects the winners and losers, and we can promote evolution in a desireable direction by using the specs and software that do what we need to do, and ignoring those which are overkill.

If this sounds naive, consider the parable of the two watchmakers in Herbert Simon's classic essay "The Architecture of Complexity" in The Sciences of the Artificial. Jessica Lipnack and Jeffrey Stamps have very nicely summarized his argument:

"Suppose each watch consists of 1000 pieces. The first watchmaker constructs the watch as one operation assembling a thousand parts in a thousand steps. The second watchmaker builds intermediate parts, first 100 modules of 10 parts each, then 10 subassemblies of 10 modules each.

It would seem that constructing a watch in a single sequential process would progress faster and produce more watches. Alas, life being what it is, we can expect some interruptions. Stopping to deal with some environmental disturbance, like a customer, the watchmaker puts down the pieces of an unfinished assembly. Each time the first watchmaker puts down the single assembly of 1000, it falls apart and must be started anew, losing up to 999 steps. Interrupting the second watchmaker working on a module of 10 using hierarchical (in the first sense) construction means a loss of at most 9 steps....

Using an elegant mathematical demonstration, Simon shows how dramatically more successful the modular-levels principle is in producing stable and flexible complexity. Nature, he says, must use this principle. And, indeed, systems scientists have extensively documented this level pattern of organization, whether physical (such as particle, atom, and molecule), biological (like the example of cell, organ, and body), social (for example, local, regional, and national government), or technological (one example is phones, local exchanges, and long-distance networks"

The moral of the story for XML seems clear. In the long run, evolutionary pressures will favor markup language specifications built from simple, modular components over monolithic "one size fits all" specs issued by some standards authority. In the short run, the consumers of XML provide the environmental disturbances that determine whether the specmakers are given time to build monolithic but fragile specifications, or if they will be forced to build "stable and flexible complexity" now rather than later.

The most important thing to remember is that nothing is forcing you to use and support XML specifications that don't meet your real data processing needs. Use the XML specifications and tools that work for you, don't feel compelled to use the spec du jour just because <JustKidding> all the really cool people are talking about the hermeneutics of their RDF ontologies, or how to pipeline their post schema validation information sets.</JustKidding>. Support the organizations, specifications, and vendors which define and implement simple, modular XML components that can survive evolutionary change.