Flat vs. Relational

Now that I’ve covered most of the details citations and bibliographies in Word 2007, let me return to the subject of the source format. The team that designed the schema made a number of design decisions. In designing the equivalent for use in OpenOffice and OpenDocument, I have made some different decisions. Similar debates have accompanied the effort to put together an hCite micro-format. Let’s compare.

So the structure of the bib schema in Office 2007 (and Brian Jones tells me, Open XML; this will be documented in the ECMA format) is a flat model, with a root of b:Sources, and primary child elements of b:Source. Typing is provided by a b:SourceType element. All properties of the bibliographic item are then described with child elements of b:Source; there is no hierarchy. So, for example, to encode titles:

  • for a Book, you use b:Title
  • for a BookSection, you use b:Title for the chapter title, but b:BookTitle for the container
  • for the journal article title, as above you use b:Title
  • for the journal title, you use b:JournalTitle
  • etc., etc.

The problem with this approach is you end with an explosion of elements to describe the range of resources. I count 9 elements that are used to describe the same thing: titles (though currently they incorrectly assume a Case “Reporter” is a contributor; rather, it’s a periodical title). And they are missing a few: CollectionTitle and SeriesTitle are the obvious ones. Essentially, every new resource type—particularly if at has some part-container relation—needs a new title structure! And every time you add a new title structure, you have to update code elsewhere (in, for example, every single XSLT file that implements your citation styles!).

Also, the modeling is inconsistent, both internally, and with respect to the document-level metadata description in OXML. On the former, a simple example: the title of a book is b:Title, except when you are describing a section within the book, at which point it is a b:BookTitle. On the latter, OXML now uses DC to describe documents, but here we see no evidence of DC.

There’s another problem, incidentally, with the structure of the MS schema, which is more a limitation of the validation technology they are using (XML Schema) than anything. Because they use the same element for all types, they cannot validate the content by type. So it will be possible, for example, to include a b:BookTitle element within a journal article record. RELAX NG has no such limitations, but the schema isn’t expressed in RELAX NG.

My approach, by contrast, is not flat, but relational. I use RDF for the relational modeling and linking. In the XML, I use typed nodes to encode the important information, which means one need only have two title structures: title and shortTitle. Conceptually, then, you end up with:

Article
   title
   isPartOf
      Journal
         title

And the majority of critical properties can be represented with standard DC and Extended DC; the same ones, incidentally, OXML already supports for the document!

Finally, an XML schema (expressed in RELAX NG) tightly controls the structure of the content by type.

More broadly, using a relational structure in which you keep the number of properties to a minimum has further benefits. The formatting system, for example, can be made much more robust.

Comments are closed.


Creative Commons License Creative Commons License