Citations in “Open” XML

For the past few years I’ve been on a mission to add rich citation support to XML document formats. I do not believe we can improve the experience for end users without it.

First step was DocBook, where Peter Flynn, Markus Hoenicka and I proposed a solution to the DocBook Technical Committee, who reviewed, then approved it, with some modifications IIRC. Recent releases of the format now include this support.

I then worked with Daniel Vogelheim (then at Sun) to adapt the DocBook citation structure to OpenDocument. We basically took the logic of the encoding approach, but added a field to render the formatted citation for display. The ODF TC subsequently approved the proposal, and it is scheduled for inclusion in ODF 1.2, due out sometime next year.

This is how open standards development works in an ideal world. Someone has an idea, it is reviewed by a group of experts, and then agreed to. Typically the idea itself builds on existing best practices. The DocBook proposal, for example, was informed by BibTeX. This open process results not only in a standard solution, but also a technically-superior one that has benefited from peer review.

The last big XML document format I was hoping to tackle was Microsoft’s. To wit, I contacted Brian Jones at Microsoft roughly a year ago suggesting they add citation support equivalent to the new ODF support to their Word XML format. I got a polite reply, but ultimately no action. Likewise, I contacted someone I knew at Apple to ask them to press on this too, given that they have signed on to the effort to standardize Microsoft’s XML formats at ECMA, and latter ISO. This is too important not to do right. Again, nothing much came of it as near as I could tell.

I was therefore somewhat surprised to learn just this week that MS has added citation and bibliography support to Word 2007. So I downloaded the latest ECMA draft for their XML formats, and sure enough find information about encoding this in XML. At this point, the spec is too vague to be of much use as it leaves out the crucial information about the logic. To understand that, Peter Sefton sent me a copy of a docx file with a citation in it. Here, then, is what a citation looks like in the new format:

For comparison, here is the OpenDocument equivalent:

As with ODF, citations are there, and they are proper dynamic fields. The source data is stored apart from the main content file. This makes it in theory trivial to regenerate formatting in vastly different styles. If footnotes are allowed in those fields, it even makes it possible to switch between radically different styles, such as in-text, and footnote-based.

The bad? Look at line 20 in the OXML screenshot. The first problem is that this information is not included in the ECMA spec. This absolutely must be specified for there to be any hope for interoperability. The second problem is that this most critical information is not encoded in XML! Rather, they are using a series of tokens to encode the information.

Now compare this design to the ODF version, where everything is encoded in elements and attributes, and it is all standardized in the schema, and later the documentation.

So how to fix this? MS is using generic structures to encode citations, so I doubt it would be appropriate to adopt the same approach as ODF. OTOH, it seems sensible to me that they allow foreign-namespaced content in the field elements, and that the information now encoded in the w:inst attribute be handled by dedicated attributes. Include those in the ECMA spec, and the primary problem is solved.

There are two other problems, however, each of which can also be fixed without much pain.

The first is the data model for the source metadata.

It is really critical that MS err on the side of generality and flexibility here or users and developers will be frustrated. Some random comments:

  1. Given Brian Jones claim that OXML has richer support for DC than ODF, it is surprising they fail to use it here. The Extended DC isPartOf structure is very useful for capturing the relational character of bibliographic metadata, and yet it seems their model is flat.
  2. More worrying, though I need to confirm this, is that they seem to have made the same mistake the StarOffice developers made, which is to base their types on BibTeX. If that’s the case, it will be unusable by scholars and students across whoe swths of the humanities, law and the social sciences.
  3. Great to see full Person elements for contributors!
  4. Disappointing to see highly Western (even U.S.)-centric content model for naming. Why not use vcard across the format, and gain international-friendliness for free?

Also, there’s no standard model here, and the author encoding is quite funky. Would be really nice to see if somehow MS might pick up our metadata work at ODF.

The second issue ties directly to the above, and that is how records are identified. Ideally, all citations should use standard uris to identify them; isbns, dois, pubmed ids, etc., etc. can all be represented as standard uris, and they should be here. It makes documents much more portable, and makes it possible to do clever things for users.

So my suggestions for MS and for the ECMA team:

  • the ECMA spec must include details of encoding the citations
  • citation logic should not be stored as strings, but using dedicated attributes and/or elements and include:
    • a uri to link to the metadata record, with preference for using standard schemes such as urn, info, etc.
    • point locators to locate particular sections based on page, paragraph, section, etc.
    • local style (to leave off the author name, for example)
    • prefix and suffix text
  • the content of citation fields must allow footnotes and endnotes
  • The source schema should:
    • use DC and DCQ wherever appropriate
    • adopt a properly relational structure by drawing on dcq:isPartOf
    • typing should be flexible; see http://purl.org/net/biblio
  • Styling. One word: CSL. This format is well-designed (if I do say so myself!), and is a crystallization of a lot of work in figuring out all the complex requirements of real world citation style. It is exactly the kind of thing that needs to be adopted as a standard. I am perfectly willing to submit it to a standards body if that would help.

So there is a lot of promise here, but a fair bit of work to do. All of my suggestions above have real world consequences for users and developers. The solution in general: standardize these pieces—in particular citation coding and style configuration—to do what standards do, which is to set the ground for real innovation and user choice.

One Comment

  1. [...] Yesterday I examined the encoding of citations and bibliographic data in Microsoft’s Open XML formats. Today I’d like to discuss another crucial piece of the puzzle, which is citation style configuration and formatting. [...]


Creative Commons License Creative Commons License