XProc 3.0 - Strategies for merging documents
November 16, 2020
Table of Contents
Introduction
XProc is an XML based programming language for processing documents in pipelines: chaining conversions and other steps together to achieve the desired results. As an introduction to XProc, the xml.com site contains an introductory article and an article about connecting ports. You can also visit the XProc 3.0 site itself for more information. And there's even a whole book dedicated to the language!
This article explores an often occurring problem when handling XML documents: merging several documents into one, using an XSLT stylesheet to bring data together. It's a bit more advanced than the other two and assumes you already know some XProc 3.0 (and a little XSLT) to start with.
The example code is freely available to play with. To run the examples you'll need an XProc processor. An overview of the available processors can be found here.
The example case
To illustrate the different merging strategies, I created an example that contains both data and several working pipelines. You can find it on GitHub: https://github.com/xatapult/xproc-merging-example.
We're going to merge three different XML documents. The merging itself is done by an XSLT stylesheet, which raises the question: how do we make all these documents available to the XSLT processor in an XProc 3.0 pipeline?
Now if it would be just like depicted in Figure 1, life would be rather easy. You designate one of the documents as your main input to the stylesheet and pull in the other two using the XPath doc()
function.
But what if the documents are not directly available on disk? For instance because you have to create, transform or validate them first? Assume for instance that the source for one of the inputs is in Markdown and you need it in XML or HTML. Or you want to do XInclude processing up-front because the document contains references? Pre-processing documents before use is not uncommon, so let's see how we can do this using XProc 3.0, solving a simple, somewhat artificial, but nonetheless illustrating, problem.
The three documents we want to merge are:
An HTML template, straightforward, no pre-processing required:
Our process is going to fill the
body
tag.A file with temperatures, linked to a city. An entry for a single city looks like this:
All these entries come from different sources, stored in a document of its own. They're combined in a master file that XIncludes everything:
A document that links city identifiers to names:
Now assume for this use case we don't really trust the production of this document. Therefore we want to make sure it's valid before we process it. The schema for this is in
data/xsd/city-ids.xsd
.
All together, the full flow we want to implement is:
The result will be a boring HTML page, listing cities and temperatures. For didactic purposes, this is kept deliberately simple. However, let nothing stand in your imagination's way to invent (and implement) more complicated scenarios…
Using disk as temporary storage (not recommended!)
If you're new to XProc and/or have implemented pipeline-like processing using Ant or command/shell scripts, you might be tempted to write intermediate results to disk. Yes, you can definitely do that in XProc but it's definitely not recommended. Why would you serialize a document to disk and, after that, re-parse it into memory, if you already have it in memory? That's an utter gross waste of CPU cycles. And it also needs a rather complicated pipeline. Nonetheless, let's implement it this way to see how it can be done, just for the fun of it.
I actually encountered a situation where I needed this: A third-party stylesheet I had to use (and couldn’t change) expected some of its input documents on disk. So I had no other choice than to write all inputs, which were computed elsewhere in my XProc pipeline, to disk first.
Here is an XProc 3.0 pipeline that writes the combined temperatures document to a temporary file. It uses the template file as main input for the merging stylesheet and passes the names of the other input files as parameters:
The three necessary documents are passed in through input ports:
Since it's not clear which input is primary, I choose to declare them all as non-primary. But there's nothing wrong with favouring one of them over the others and make it primary (
primary="true"
).With more than one input port,
primary="false"
is the default, so you could have omitted theprimary
attribute.I was deliberately verbose in declaring the ports, specifying everything I could:
sequence="false" content-types="xml"
. This is not strictly necessary but it will make your pipeline more robust: if you ever accidentally pass in multiple documents or a non-XML document, the step will fail. I consider it good practice.sequence="false"
is the default, so you could have omitted thesequence
attribute.To simplify running the pipeline, default document connections were added to the ports, using the
href
attribute. If you specify some other connection for one of these input ports (on the command line for instance), this default will be ignored.
The output port is declared as one that, from inside the pipeline, receives XML (
content-types="xml"
), but advises the processor to use HTML when it serializes the document (writes it to disk) (serialization="map{'method': 'html', …}"
). Fortunately for us, the processor usually follows advice…The
p:file-create-tempfile
step creates a temporary file and outputs the absolute URI of this in a singlec:result
element. Thedelete-on-exit="true"
attribute tells the processor to try to conveniently delete the temporary file after all processing is finished.After the temporary file is created, we put its URI in the
temps-filename
variable.The
p:xinclude
step resolves the XIncludes in the temperatures file. After that, we write it to the temporary file we created previously.Then we validate the city identifiers document. If this fails, an error will be thrown and we're done.
Since we want our stylesheet to read all additional documents from disk, we need the URI of the city identifiers document. Here we store it in the
city-ids-filename
variable.Having doubts about this construction? You should! This will fail miserably if we're ever going to use this pipeline as a step in another pipeline. In that case there's no guarantee that the city identifiers document, coming in through the
city-ids
port, is on disk at all. Maybe it was created or computed… For this simple case it works, but don't rely on these kinds of constructions in more complex projects.Finally we transform it using XSLT. The main input to this transformation is our HTML template document, coming in through the
template
port. We pass the filenames for the other two documents using thep:xlst
step'sparameters
option.
It depends...
A stylesheet that creates the final result looks like this (not intended as a particularly illustrative
example of an XSLT stylesheet, just short. xsl:for-each
haters, please hold your fire ;) ):
Wrapping all documents into one
Using a temporary file to store a constructed intermediate XML document, as we did in the section called “Using disk as temporary storage (not recommended!)”, is not only a waste of computing resources, it also results in an unnecessarily complicated pipeline (Example 5). So how can we do better? A much simpler and pretty common strategy is to take all documents, wrap them up in an encompassing document and feed this to the XSLT stylesheet.
The wrapped document we're going to create will look like this:
And here is a pipeline that creates and uses such a wrapped document:
(For an explanation of the input and output port declarations in the prolog, see the explanation of Example 5.)
We validate the city identifiers document. If this fails, an error is thrown and we're done.
When nothing’s wrong, the
p:validate
step acts like ap:identity
step and outputs its input, unchanged (that is to say, it might add the Post-Schema-Validation-Infoset (PSVI) annotations but that’s something we’re not using here).The
p:xinclude
step resolves the XIncludes in the temperatures file.We create the wrapped document by feeding the
p:wrap-sequence
step a sequence of the three documents, using three whitespace separated entries in thep:with-input
'spipe
attribute. Each entry is a connection to a port:The template document is inserted directly from the main pipeline’s
template
port.The temperatures document is inserted from the output (
result
port) of thep:xinclude
step.The city identifiers document is inserted from the output (
result
port) of thep:validate
step.Notice that we explicitly read it from the output of
p:validate
, not from the main pipeline’scity-ids
port, even though both documents have the same contents. Doing it like this creates an explicit dependency and ensures thatp:validate
is run beforep:xslt
.
Now XSLT can do its magic on the wrapped documents. It feeds automatically into the
p:xslt
's primarysource
port because of the implicit connection between thep:wrap-sequence
andp:xslt
steps.
Here's a simple XSLT stylesheet that uses these wrapped documents:
Passing documents as parameters
Yet another way to do the same is by passing the additional documents to the XSLT stylesheet as document parameters.
That is, as parameters of type document-node()
. You might, like I was, be surprised that a thing like this is even possible:
limiting parameters in stylesheets to strings was always the safe and easy way. But with the arrival of XPath 3.1 you can now also use
other data types for this.
A pipeline that exploits this feature looks like this:
(For an explanation of the input and output port declarations in the prolog, see the explanation of Example 5.)
Like in Example 8, validate the city identifiers document.
The
p:xinclude
step resolves the XIncludes in the temperatures file.To be able to pass the additional documents, we create two variables of type
document-node()
:The first one,
city-temps-document
, refers, by implicit connection, to the document that comes out of the precedingp:xinclude
step.The second one,
city-ids-document
, refers, by explicit connection usingpipe="result@validate-city-ids-document"
, to the document that comes out of thep:validate
step. Again, like in the section called “Wrapping all documents into one”, we use this and not thecity-ids@main-pipeline
port to make sure thep:validate
runs before thep:xslt
.
The
p:xslt
step runs with the template document fed explicitly into its primary port (by<p:with-input pipe="template@main-pipeline"/>
). The additional documents are passed in as parameters using the map in theparameters
option.
And a simple XSLT stylesheet that uses these parameters:
Passing all documents as a collection
One more way to do the same is by passing all the documents as a collection. This requires some explanation upfront:
We’re going to tell
p:xslt
that it should start by invoking a named template (instead of apply templates) by setting thetemplate-name
option (to the name of the template we want to invoke).Setting the
template-name
option has the intended additional effect of making all documents on thesource
port available as the default collection, accessible with the XPathcollection()
function.We then pass all our documents (the HTML template, the city temperatures and the city identifiers) as a sequence (of documents) on
p:xlst
'ssource
port.With all this, in the XSLT stylesheet, the
collection()
function now returns the sequence of our three input documents (count(collection())
will be3
). The root element of, for instance, the city temperatures document can be accessed bycollection()/city-temperatures
.
A pipeline that uses this collection feature looks like this:
(For an explanation of the input and output port declarations in the prolog, see the explanation of Example 5.)
Like in Example 8, validate the city identifiers document.
The
p:xinclude
step resolves the XIncludes in the temperatures file.Invoke the XSLT stylesheet with the
template-name
option set toprocess-documents
. This will have the XSLT processor look for a named template calledprocess-documents
and start there.We feed the primary
result
port a sequence of our three documents, using three whitespace separated entries in thep:with-input
'spipe
attribute. Each entry is a connection to a port. This is similar to what we did in Example 8 for the input top:wrap-sequence
.
And the XSLT stylesheet that uses the collection looks like this. Notice the
process-documents
named template. This is where the processing will start.
Wrap up
As we've seen, there are several ways to pass documents to an XSLT stylesheet for merging:
By using the file system as intermediate storage (the section called “Using disk as temporary storage (not recommended!)”). Definitely not recommended.
By wrapping the documents in some container document and pass this to the stylesheet (the section called “Wrapping all documents into one”).
By passing the additional documents to the stylesheet as document type parameters (the section called “Passing documents as parameters”).
By passing the documents as a collection (the section called “Passing all documents as a collection”).
I don't really have a recommendation which one to use (except not to use the first). Wrapping documents (number 2) is for me personally the "classic" way to do this, leading back to my extensive XProc 1.0 pipelines. It has the slight advantage that the input to your XSLT stylesheet is easy to emulate (just create an appropriate XML document), allowing you to develop and test the stylesheet without having to run it in an XProc context. But the other approaches also have their merits. You can even mix the approaches if that suits your requirements. And maybe there are more ways to do this (if you find one, let me know).
So: experiment, pick, and choose. Happy XProc-ing!