Schemas: Different Strokes From Different Folks

January 15, 2017

relax-ng schematron xml-schema

In defense of multiple schema languages for XML, but Schematron in particular.

Schema languages are the bedrock of XML software engineering; but why are they so different? Because their creators approached them with very different viewpoints about how software engineering should work. The Engineering part of Software Engineering is relatively easy to define: something like the disciplined and creative application of scientific and technical knowledge to produce predictable systems. But what people mean by the Software part in practice can vary widely, and I think it allows an insight into why there are are (and why there need to be) multiple schema languages for XML. I had front row seats for the development of XML Schemas, Schematron and RELAX NG, and I think each sprang out of quite different views about software.

There's Nothing as Practical as a Good Theory

There is an old Computer Science dictum: data structures + algorithms = programs. You work out which algorithm is optimal given the data structures and constraints you have, and then you implement it. Along the way you may add some sweeteners and take advantage of low-hanging fruit, but you don't want to compromise the elegance of the solution, out of fear of losing performance and predictability: you are diligent at resisting scope-creep. The question you ask is What is the sweet spot between power and performance and implementability?.

This is the approach taken with RELAX NG, a schema language now an ISO Standard created by Makoto Murata and James Clark. (The data structure is the XML document, considered as a kind of tree called a "hedge": a hedge is a sequence of elements possibly interleaved by character data. The algorithm is implementing a regular hedge grammar using derivatives.) With RELAX NG, some quite complex constraints can be expressed tersely and clearly. There is some sweetening forced, with the allowance of XML Schemas-like data types. And there is some low-hanging fruit: the extension of the content model to allow attributes as well as elements and text.

Just Fit In

The second approach would be to say that successful software is like a jigsaw puzzle: you want to develop one piece that fits into the existing infrastructure. You can see how this approach appeals both to standards-makers (people who make standards want to use standards, Kool-Aid fashion) and to vendor corporations (who want us to be locked into their technical eco-system as much as possible.) The ideas of elegance or power are not the point, and indeed somewhat mysterious: the focus is on how to fit in and not re-invent any wheels. The question you ask is "How can we go from what we have to where we need to go?"

This is the approach taken with W3C XML Schemas, a schema language created by a group at World Wide Web Consortium with substantial involvement by members representing database companies (IBM/Lotus, Oracle, Microsoft). One meeting I found myself sitting next to Peter Chen, of Entity-Relationship diagram fame. XML Schemas starts with a grammar system no more powerful than DTDs, then grafts on things to make it fit the legacy technical infrastructure better. This pays off in XML Schemas usefulness for data-binding applications, where you want to use the schema directly to read XML into or write XML out from some object or database representation. XML Schemas is schema language to help make XML look more like what people who have not used XML might expect: object-like inheritance here (type derivation), database-like nulls there, forms-like data field-validation.

Some innovations like the all content model (all elements in any order) were only allowed after it was shown that it did not require a different class of grammars and implementations (however, in fact it did also have an obvious and efficient special implementation), and was accepted, I think, because it might handle database fields simpler, without an explosion in the schema document. So the focus was not elegance or power or easy implementability or comprehensibility but practical integration.

Breath In, Breath Out

A third approach looks through the lens of Software Development Life Cycle: software development is ultimately the process of going from informal human expression of ideas to executable code and then, like a breath out after the breath in, communicating back information about that executed code as human-comprehensible results. Software modularization should break up tasks along the same lines, of smaller human-computer-human cycles.

So what a schema language for XML needs to do is provide a way of capturing human information about the patterns and lifecycle of a document, execute it on a computer, and give a result in terms that are meaningful to a human. The question you ask is "What pattern are you looking for, how do you calculate that, and how can you best inform the humans?" And voilà Schematron...

Schematron provides assertions to capture in natural language (and rich text) the requirements; these lists of requirements have a full compliment of markup; IDs, icons, labels and links to supporting material. The schema and the main elements have titles and paragraphs. You can model and describe variations such as dialects and subsets using the phases mechanism. Forcing the user to formulate the requirement, even though just in natural language, is a big step away from fuzziness. A good template for an assertion is "An X should have a Y because of Z".
Schematron uses XPath (typically!) to provide the executable code equivalents of the natural language. There is good access to the web: data can be pulled in from any POX (Plain Old XML) web service using HTTP GET. Paths can be simplified and execution improved with variables. The result of validation can be an XML document (in the standard SVRL format) that can be further processed by subsequent XML systems. Arbitrary information and labels can be calculated and linked to the original document using the new property mechanism.
And finally Schematron allows custom messages to be constructed and reported back to the user, using the diagnostics mechanism. There are no built-in validation messages in Schematron: the developer of the schema is completely free to express their diagnostic messages back to the users in terms that can best explain what patterns were found (or were missed) in the document.

Only Schematron provides any support at all for the human aspects: not only that natural language is useful, nor that the user may need an explanation of the problem in terms they understand, but also that human activity is dispersed geographically or organizationally (hence the need for access of information by URL) and sliced temporally as workflows, variants and lifecycles (hence the need for phases and parameters.) The other schema languages belong to a small world where there is no larger information system, no time, and no humans: they provide solutions (elegant and elephantine) to a too-small problem.

When simple XPath assertions were being added to XML Schemas 1.1, I was in communication with Working Group members (who I really respect: my problems with XML Schemas spring out of this "small world" assumption that I find ludicrous, not the people or personalities) to say that there is no need to tie their assertions back to Schematron in some way, such as just using the Schematron namespace and element names, because without the primacy of the natural language assertion, they are fundamentally different things.

Closed Under Union?

As an example of the three approaches, lets consider the property a schema langage could have called "closure under union", which is whether a single schema can handle two different schemas: lets say two dialects of basically the same vocabulary of elements in the same schema.

According to James Clark "for any two RELAX NG schemas, there is a RELAX NG schema for its union". Dr Murata's early research concerned set operations on grammars, in fact. RELAX NG allows ambiguity during the course of validation really well, until finally deciding whether the document is valid, and so can support dialects well. Elegant and powerful.
In contrast, XML Schemas is certainly not closed under union, because the idea of supporting multiple variations or dialects was completely foreign to the group designing it: you would use some other layer, or tools, or derive multiple schemas.
But in Schematron, where we ask why do they want this rather than what is the mathematical property? there is partial support for doing what the property closed under union allows: if you use the phase mechanism you can certainly combine two different schemas, but the language does not take care of the disambiguation: you have to tell Schematron which phase should be active: which patterns are in common whatever dialect is in and which patterns only belong to a single dialect. In Schematron, the important thing is to explicitly represent that you have these different variants, not ignore them or necessarily handle them automatically.

[Article's final paragraph clarified in response to discussion on XML-DEV mail list.]