The use of OASIS genericode for arbitrary sparsely-populated tables in XML

November 5, 2017

OASIS genericode

OASIS Genericode is a standard developed for interchanging code lists in XML. Here, Ken Holman discusses an unanticipated but useful application of Genericode to tables that are not entirely full of content and not in any way related to code lists.

Abstract

This is a technical essay examining the use of OASIS genericode to serialize tabular content in XML syntax, and the use of XSLT to access such content. In essence, the names of the constructs in genericode leave the impression the format is suitable only to serialize code lists. But a common aspect of a table of codes with their applicable metadata is that for many codes only some or none of the metadata may apply. As genericode handles this well, it proves suitable for serializing any sparsely-populated table, not just tables of codes. This essay illustrates the use of genericode to represent arbitrary sparse tables and the use of XSLT to access them.

Recently I was asked to summarize how OASIS genericode 1.0 is fit for purpose in expressing arbitrary sparsely-populated tabular content. The way the question was framed to me was “why are you using genericode to express the properties of a document model rather than create a new or other XML vocabulary to do so?”. The document models where genericode is being used are those of business documents used in OASIS projects.

The genesis of genericode was for the XML expression of a set of coded values from a coded value domain and their associated metadata (both at the list level as a whole and at the coded value level individually). Other uses of the new XML vocabulary were not discussed at the time, as the focus was only for code lists.

A densely-populated table, that is one where most of the columns of all rows have content, can efficiently be expressed in a tight row/column XML vocabulary such as the OASIS Exchange Table Model (derived from CALS) and the XHTML table model. Even tighter representations are in non-XML format such as Comma-separated Values (CSV) or Tab-separated Values (TSV). But these were discounted when considering genericode in favour of an associative representation that is not nearly as compact but is much more easily read by human eyes and more declarative from a programming perspective. Moreover, the associative representation can be somewhat self-documenting regarding semantics when the labels are chosen appropriately.

Expressing a table in a dense representation requires readers and programmers to track the ordinal column number while traversing the row (with which, then, to infer the associated semantics), as well as order the columns identically in each row. This may not always be convenient when dealing with a sparsely-populated table. Certainly it can be programmed, but the expression of the table does not manifest the column associations with the column values. Thus, a new XML vocabulary for code lists was created with these manifest associations. But given that code-level metadata can be different for every code list, rather than create an XML vocabulary with explicit semantic elements genericode uses a generic column definition element and a generic column reference element allowing an arbitrary column reference. Thus, users are free to create their own semantic labels as associations. Such manifest expression is helpful to the human reader of the table, useful to the programmer, and can be self-documenting in programming code. It also opens the XML format to be useful for arbitrary semantics that have nothing to do with code lists.

Consider also a sparsely-populated table where each row may have many available column values to be defined but only a few columns actually specified in any given row. The lack of a manifest association between the values and the column identifications can make the table difficult to interpret to human eyes and may be a challenge to create (where, for example, an errant comma in a CSV file throws everything on the row out of order). That genericode provides such makes it eminently suitable for serializing sparse tables.

OASIS genericode 1.0 support files are found here:

http://docs.oasis-open.org/codelist/cs-genericode-1.0/ - specification and XML schema
https://cranesoftwrights.github.io/resources/ubl/index.htm#oxygengc - oXygenXML editing framework

The two major components of a genericode file are the identification elements and the table elements.

The identification elements provide a home for distinguishing metadata one would use to associate the entire table with some external semantic definition. For example, in the currency identification coded value domain such metadata would associate the table to be a set of codes for, say, the ISO-maintained list of codes.

The table elements have two top-level items: the column specifications and the rows of values. The column specifications distinguish each of the available column values by identifiers and can include supplementary information (such as the column value’s data type). Also, it is possible to identify which columns or combinations of columns are intended to be keys into the rows. Accordingly, it is expected that the reified key column values in each of the rows are unique across the entire set of rows. This enables an application accessing the rows to find a given row by its key value/combination.

Although created for expressing lists of code and associated metadata, in fact any connotation of “code” in element names and semantics is found only in the <CodeList> document element and the <SimpleCodeList> table wrapper. Otherwise, there are elements for table identification metadata and elements for table content expression, but the semantics of those are entirely table-oriented and not code- or code-list-oriented. And, as mentioned before, the semantics of the associations are entirely user-defined.

Certainly most code lists have few columns and many rows and would be considered densely-populated, however, there are code lists whose value-level metadata values are sparsely-populated. Such associated values might be translations of values or definitions associated with the codes. The Unit of Measure code list in the UBL 2.1 distribution has 1 row with 2 metadata values, 444 rows with 3 metadata values, 1235 rows with 4 metadata values, 395 rows with 5 metadata values, and 15 rows with 6 metadata values:

http://docs.oasis-open.org/ubl/os-UBL-2.1/cl/gc/default/UnitOfMeasureCode-2.1.gc

An example row reads as follows:

<Row>
   <Value ColumnRef="code">
      <SimpleValue>M25</SimpleValue>
   </Value>
   <Value ColumnRef="name">
      <SimpleValue>percent per degree Celsius</SimpleValue>
   </Value>
   <Value ColumnRef="description">
      <SimpleValue>A unit of proportion, equal to 0.01, in relation to a temperature of one degree.</SimpleValue>
   </Value>
   <Value ColumnRef="levelcategory">
      <SimpleValue>3.7</SimpleValue>
   </Value>
   <Value ColumnRef="symbol">
      <SimpleValue>%/°C</SimpleValue>
   </Value>
   <Value ColumnRef="conversionfactor">
      <SimpleValue>10² °C¹</SimpleValue>
   </Value>
</Row>

The column references above point to column definitions with properties about each column, in this example defined as follows:

<ColumnSet>
  <Column Id="status" Use="optional">
     <ShortName>Status</ShortName>
     <LongName xml:lang="en">Status</LongName>
     <Data Type="normalizedString"/>
  </Column>
  <Column Id="code" Use="required">
     <ShortName>Code</ShortName>
     <LongName xml:lang="en">Common Code</LongName>
     <Data Type="normalizedString"/>
  </Column>
  <Column Id="name" Use="required">
     <ShortName>Name</ShortName>
     <LongName xml:lang="en">Name</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="description" Use="optional">
     <ShortName>Description</ShortName>
     <LongName xml:lang="en">Description</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="levelcategory" Use="optional">
     <ShortName>LevelCategory</ShortName>
     <LongName xml:lang="en">Level / Category</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="symbol" Use="optional">
     <ShortName>Symbol</ShortName>
     <LongName xml:lang="en">Symbol</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="conversionfactor" Use="optional">
     <ShortName>ConversionFactor</ShortName>
     <LongName xml:lang="en">Conversion Factor</LongName>
     <Data Type="string"/>
  </Column>
  <Key Id="codeKey">
     <ShortName>CodeKey</ShortName>
     <ColumnRef Ref="code"/>
  </Key>
</ColumnSet>

At the end of the column set the “code” column is declared to be a key column, that is, no two rows of the table will have the same value in that particular column.

But consider using associations for other sets of semantics not at all related to code lists. Another example of a sparsely-populated table is found in UBL 2.1, that being the properties of the business information entities (BIEs) of the UBL document models. During UBL development it was determined that the model information needed to be in XML to support tooling the distribution artefacts. The manifest nature of genericode informed the decision to express the models in this XML syntax even though none of the BIE semantics have anything to with lists of codes. That (very large!) resulting genericode file is found at:

http://docs.oasis-open.org/ubl/os-UBL-2.1/mod/UBL-Entities-2.1.gc

The OASIS Business Document Naming and Design Rules Version 1.0 (version 1.1 is in development) guides the UBL Technical Committee on the use of UN/CEFACT Core Component Technical Specification (CCTS) v2.01 to express the document models and in this document one finds the semantics of the value associations in the table:

http://docs.oasis-open.org/ubl/Business-Document-NDR/v1.0/Business-Document-NDR-v1.0.html

The resulting models are sparsely-populated tables where one finds a total of 245 rows with 14 values, 46 rows with 15 values, 2 rows with 16 values, 586 rows with 18 values, 1263 rows with 19 values, 1088 rows with 20 values, 648 rows with 21 values, 204 rows with 22 values, 29 rows with 23 values, and 1 row with 24 values.

An example row is as follows where one finds no reference to code list semantics:

<Row><!--2332-->
   <Value ColumnRef="ModelName">
      <SimpleValue>UBL-CommonLibrary-2.1</SimpleValue>
   </Value>
   <Value ColumnRef="UBLName">
      <SimpleValue>PriceAmount</SimpleValue>
   </Value>
   <Value ColumnRef="DictionaryEntryName">
      <SimpleValue>Unstructured Price. Price Amount. Amount</SimpleValue>
   </Value>
   <Value ColumnRef="ObjectClass">
      <SimpleValue>Unstructured Price</SimpleValue>
   </Value>
   <Value ColumnRef="PropertyTermPossessiveNoun">
      <SimpleValue>Price</SimpleValue>
   </Value>
   <Value ColumnRef="PropertyTermPrimaryNoun">
      <SimpleValue>Amount</SimpleValue>
   </Value>
   <Value ColumnRef="PropertyTerm">
      <SimpleValue>Price Amount</SimpleValue>
   </Value>
   <Value ColumnRef="RepresentationTerm">
      <SimpleValue>Amount</SimpleValue>
   </Value>
   <Value ColumnRef="DataType">
      <SimpleValue>Amount. Type</SimpleValue>
   </Value>
   <Value ColumnRef="Cardinality">
      <SimpleValue>0..1</SimpleValue>
   </Value>
   <Value ColumnRef="ComponentType">
      <SimpleValue>BBIE</SimpleValue>
   </Value>
   <Value ColumnRef="Definition">
      <SimpleValue>The price amount.</SimpleValue>
   </Value>
   <Value ColumnRef="Examples">
      <SimpleValue>23.45</SimpleValue>
   </Value>
   <Value ColumnRef="CurrentVersion">
      <SimpleValue>2.1</SimpleValue>
   </Value>
   <Value ColumnRef="ContextBusinessProcess">
      <SimpleValue>In All Contexts</SimpleValue>
   </Value>
   <Value ColumnRef="ContextRegionGeopolitical">
      <SimpleValue>In All Contexts</SimpleValue>
   </Value>
   <Value ColumnRef="ContextOfficialConstraints">
      <SimpleValue>None</SimpleValue>
   </Value>
   <Value ColumnRef="ContextProduct">
      <SimpleValue>In All Contexts</SimpleValue>
   </Value>
   <Value ColumnRef="ContextIndustry">
      <SimpleValue>In All Contexts</SimpleValue>
   </Value>
   <Value ColumnRef="ContextRole">
      <SimpleValue>In All Contexts</SimpleValue>
   </Value>
   <Value ColumnRef="ContextSupportingRole">
      <SimpleValue>In All Contexts</SimpleValue>
   </Value>
   <Value ColumnRef="ContextSystemConstraint">
      <SimpleValue>In All Contexts</SimpleValue>
   </Value>
</Row>

In this subset of the CCTS set of columns, one sees at the very end the key column to be the dictionary entry name:

<ColumnSet>
  <Column Id="ModelName" Use="required">
    <ShortName>ModelName</ShortName>
    <LongName>Model Name</LongName>
    <Data Type="string"/>
  </Column>
  <Column Id="UBLName" Use="optional">
     <ShortName>UBLName</ShortName>
     <LongName>UBL Name</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="DictionaryEntryName" Use="required">
     <ShortName>DictionaryEntryName</ShortName>
     <LongName>Dictionary Entry Name</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="ObjectClassQualifier" Use="optional">
     <ShortName>ObjectClassQualifier</ShortName>
     <LongName>Object Class Qualifier</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="ObjectClass" Use="optional">
     <ShortName>ObjectClass</ShortName>
     <LongName>Object Class</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="PropertyTermQualifier" Use="optional">
     <ShortName>PropertyTermQualifier</ShortName>
     <LongName>Property Term Qualifier</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="PropertyTermPossessiveNoun" Use="optional">
     <ShortName>PropertyTermPossessiveNoun</ShortName>
     <LongName>Property Term Possessive Noun</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="PropertyTermPrimaryNoun" Use="optional">
     <ShortName>PropertyTermPrimaryNoun</ShortName>
     <LongName>Property Term Primary Noun</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="PropertyTerm" Use="optional">
     <ShortName>PropertyTerm</ShortName>
     <LongName>Property Term</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="RepresentationTerm" Use="optional">
     <ShortName>RepresentationTerm</ShortName>
     <LongName>Representation Term</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="DataTypeQualifier" Use="optional">
     <ShortName>DataTypeQualifier</ShortName>
     <LongName>Data Type Qualifier</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="DataType" Use="optional">
     <ShortName>DataType</ShortName>
     <LongName>Data Type</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="AssociatedObjectClassQualifier" Use="optional">
     <ShortName>AssociatedObjectClassQualifier</ShortName>
     <LongName>Associated Object Class Qualifier</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="AssociatedObjectClass" Use="optional">
     <ShortName>AssociatedObjectClass</ShortName>
     <LongName>Associated Object Class</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="AlternativeBusinessTerms" Use="optional">
     <ShortName>AlternativeBusinessTerms</ShortName>
     <LongName>Alternative Business Terms</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="Cardinality" Use="optional">
     <ShortName>Cardinality</ShortName>
     <LongName>Cardinality</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="ComponentType" Use="optional">
     <ShortName>ComponentType</ShortName>
     <LongName>Component Type</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="Definition" Use="optional">
     <ShortName>Definition</ShortName>
     <LongName>Definition</LongName>
     <Data Type="string"/>
  </Column>
  <Column Id="Examples" Use="optional">
     <ShortName>Examples</ShortName>
     <LongName>Examples</LongName>
     <Data Type="string"/>
  </Column>
...{documentary column definitions elided from the essay}...
  <Key Id="key">
     <ShortName>Key</ShortName>
     <ColumnRef Ref="DictionaryEntryName"/>
  </Key>
</ColumnSet>

Accessing the information in XSLT is straightforward, first through the use of the “key” facility in order to find the rows that are needed, and then a function to get at the column values:

<xs:key>
  <para>
    Index the genericode file for all BIEs by DEN.
  </para>
</xs:key>
<xsl:key name="gu:bie-by-den" match="Row" 
         use="gu:col(.,'DictionaryEntryName')"/>

Rows can be indexed by any combination of columns to be useful in program logic, such as this combination of class, name and type:

<xs:key>
  <para>Keeping track of entities by their class and type.</para>
</xs:key>
<xsl:key name="gu:bie-by-class-name-type" match="Row" 
         use="concat( gu:col(.,'ObjectClass'),' ',
                      gu:col(.,$gu:names),' ',
                      gu:col(.,'ComponentType'))"/>

I use the following function to return a given column from a row based on the column reference:

<xs:function>
  <para>Return a row's column value based on a column reference</para>
  <xs:param name="row">
    <para>The row of the genericode file.</para>
  </xs:param>
  <xs:param name="col">
    <para>
      The column reference of the value in the row.  Note that multiple
      column references are allowed, but only one of the column references
      is allowed to match.  If the row matches more than one column name
      given, this will abend in a runtime error.
    </para>
  </xs:param>
</xs:function>
<xsl:function name="gu:col" as="element(SimpleValue)?">
  <xsl:param name="row" as="element(Row)"/>
  <xsl:param name="col" as="xsd:string*"/>
  <xsl:variable name="gu:return" as="element(SimpleValue)*"
                select="$row/Value[@ColumnRef=$col]/SimpleValue"/>
  <xsl:if test="count($gu:return) > 1">
    <xsl:message terminate="yes">
      <xsl:text>Data error: multiple genericode values in a single </xsl:text>
      <xsl:text>row for column reference</xsl:text>
      <xsl:if test="count($col)>1">s</xsl:if>: <xsl:text/>
      <xsl:value-of select="$col" separator=", "/> at <xsl:text/>
      <xsl:for-each select="$row/ancestor-or-self::*">
        <xsl:text/>/<xsl:value-of select="name(.)"/>
        <xsl:if test="self::Row">[<xsl:number/>]</xsl:if>
      </xsl:for-each>
    </xsl:message>
  </xsl:if>
  <xsl:sequence select="$gu:return"/>
</xsl:function>

In this example from some CCTS “old model to new model” consistency checking, the column references help to document the logic. The variable $old is the root of the genericode of the previous version of the CCTS model, and the variable $new is that of the current reporting version of the CCTS model. This code creates a variable of <ndrinfo> elements for all BIEs that had a data type qualification in the old model and do not have a data type qualification in the new model:

<xs:variable>
  <para>What QDT's are missing?</para>
</xs:variable>
<xsl:variable name="gu:qdt" as="element(ndrinfo)*">
  <xsl:for-each select="$old//Row[gu:col(.,'DataTypeQualifier')]">
    <xsl:variable name="gu:oldDEN"
                  select="gu:col(.,'DictionaryEntryName')"/>
    <xsl:variable name="gu:oldClassNameType"
                 select="concat(gu:col(.,'ObjectClass'),' ',
                                gu:col(.,$gu:names),' ',
                                gu:col(.,'ComponentType'))"/>
    <xsl:variable name="gu:oldDTQ"
                  select="gu:col(.,'DataTypeQualifier')"/>
    <xsl:variable name="gu:newClassNameType"
                  select="key('gu:bie-by-class-name-type',$gu:oldClassNameType,$new)"/>
    <xsl:if test="exists($gu:newClassNameType) and
                  not($gu:newClassNameType/gu:col(.,'DataTypeQualifier')=$gu:oldDTQ)">
      <ndrinfo gu:den="{$gu:oldDEN}" old="{$gu:oldDTQ}" gu:var="gu:qdt"
           new="{key('gu:bie-by-den',$gu:oldDEN,$new)/
                 gu:col(.,'DataTypeQualifier')}"/>
    </xsl:if>
  </xsl:for-each>
</xsl:variable>

So in many ways genericode makes both the data and the code easy to read by a human and easy to work with. Of course elaborate coding can get around many obscurities in unlabeled columns of content of a dense table, but there are no such obscurities when working with data containing manifest associations with their semantic inference.

Such has made it easy to create a suite of XSLT-based CCTS validation and processing applications, as depicted in this data flow of the creation of validation and reporting artefacts for document models following the OASIS Business Document Naming and Design Rules:

Figure 1. Validation artefact production data flow

Note how the genericode serialization of the model from the ODF collaboration spreadsheets is the basis of the creation of all artefacts, including the ODF version of the distributed spreadsheets (the XLS version is created using OpenOffice). The collaboration spreadsheet has properties useful to the committee members and these need not be distributed to end users. Accordingly, the collaboration spreadsheet is translated into genericode (and there is a free tool on Crane’s GitHub web site with which to do so) and then the genericode is used to create the distribution artefacts (with other free tools available on Crane’s GitHub web site).

One final side note regarding genericode’s use of XML namespaces is interesting. The document element is in the genericode namespace, but all of the other elements are in no namespace. From an XPath perspective, accessing the document element typically requires using a namespace prefix, but accessing all other elements is done without a prefix. After having confirmed once in my XSLT that I am dealing with a genericode document element, I then use the “*” wild card for the document element in the rest of my XPath expressions. Thus I do not have to deal with namespace prefixes in the rest of the code.

To illustrate this, an an XML instance the genericode document element and its children reads as follows, and note the lack of a prefix on the children that are in no namespace:

<gc:CodeList
    xmlns:gc="http://docs.oasis-open.org/codelist/ns/genericode/1.0/">
  <Identification>...</Identification>
  <ColumnSet>...</ColumnSet>
  <SimpleCodeList>...</SimpleCodeList>
</gc:CodeList>

When considering the serialization in XML of tabular information, consider the human factors of data inspection and application coding. Consider also the opportunity for labeling content within a published vocabulary using semantic association. Should your tables be sparsely-populated, consider the use of genericode … just ignore the fact that the table construct name is “code list”.