XML.com

Release of Open and Save Archive, 2019

October 31, 2019

Submitted by Joel Kalvesmaki.

Open and Save Archive

Open and Save Archive (formerly xslt-for-docx; license GNU General Public License) is an XSLT library that allows users to extract and save components of docx (Microsoft Word), xlsx (Microsoft Excel), zip, epub, odt, ods, jar, rar, and all sorts of compressed archive formats.

Saxon PE and EE: any kind of archive can be retrieved or saved.

Saxon HE: only docx and xlsx formats can be opened and saved, and without any binary components (images, videos, etc.).

The library is written in XSLT 3.0, and relies upon extension functions defined by EXPath.

Practical applications are featured in example subdirectories:

  1. Unpacking and saving archives: basic demonstration of how to fetch the component parts of an archive, then to repackage and save them. 
  2. Plain text: shows how to scrape multiple docx or xlsx files for their plain text content and concatenate it in a single file.
  3. Replacing text via regular expressions: shows how to do a search and replace on a Word or Excel file using regular expressions. This example is important because regular expressions are non-existent in Excel, and quite deficient in Word. Finding and replacing text in Word is tricky, and I use and explain in this example what I call the Splintered Seas Technique (with apologies to anyone who might have invented, used, and named a similar technique before me).
  4. Make form letters: shows how to turn a template Word document and an XML database into form letters. This example is important because Word cannot easily handle data that does not fit the spreadsheet model, and does not have good tools for coordinating and manipulating data. In this example, you use XSLT to define variables of your choice, then you type those variables wherever you like within the docx template, e.g., $family-name. You can iterate over multiple values, and apply XSLT functions to change the data and its formatting as you like--things that are difficult or impossible to do in Word.
  5. Anonymize documents: shows how to quickly scrub from the metadata the names of those who are credited writing a document or its comments or tracked changes. This is useful when returning to an author a manuscript that has been through blind peer review, and you wish to preserve the anonymity of the writers. To my knowledge this functionality is missing from Word.

Word and Excel files can get quite complex. The XSLT files in the examples have been written specifically for the accompanying sample input. You may need to develop the code to handle the input you are working with. 

For EXPath users: The function tan:extract-map() is my attempt to instantiate, enhance, and develop the EXPath function arch:extract-map(). See the stylesheet for more comments.

Numerous other features and caveats are explained in the notes and the main stylesheet.


News items may be commercial in nature and are published as received.