Archive for June, 2006

SPARQL Clipboard

Posted in Uncategorized on June 9th, 2006 by darcusb – Comments Off

Don’t have much to say about this new SPARQL clipboard, except cool!

Citation Formatting in Word 2007

Posted in Uncategorized on June 9th, 2006 by darcusb – 3 Comments

Yesterday I examined the encoding of citations and bibliographic data in Microsoft’s Open XML formats. Today I’d like to discuss another crucial piece of the puzzle, which is citation style configuration and formatting.

Examining the contents of an example document, I came across the following attribute: SelectedStyle=”\APA.XSL”. This naturally suggested to me they’re using XSLT to do the formatting. A quick ping to M. David Peterson confirmed it.

In general, this is a very good thing. They’re using a W3C standard technology in such a way that it ought to be possible to easily enhance it, or substitute alternate implementations. So if that’s all true, kudos to Microsoft!

Not surprisingly, though, given that I have a little experience using XSLT for these purposes, I have some thoughts/observations.

The first is that bibliographic and citation formatting is pretty complicated, and fully supporting a style like APA using XSLT 1.0 is going to be really difficult. Just take a close look at the output example from citeproc for the APA style. This is hard to do even with the much more advanced capabilities of XSLT 2.0. I’d venture to say it is impossible to fully implement n XSLT 1.0 without extensions.

Even if an XSLT expert manages to program it, it will be almost impossible for even tech-savvy users to create or edit styles in any significant way. I consider my XSLT skills strong, and I find understanding how I’d modify or implement a style really difficult. The code is really complicated.

Just as a hint to the complexity, the archives with the XSLTs—both some generic processing files, as well as 10 styles—weighs in at 2.6 MB (!). The APA.XSL file is a whopping 340 KB. By contrast, the lib directory (which contains all the XSLT files) of XSLT 2.0 version of citeproc weighs in at 584 KB. Though this doesn’t include the CSL files to configure the styles, those are each quite small (my APA style, which AFAIK fully implements the spec, is only 8 KB).

But what this does suggest to me is that it ought to be easy to swap in citeproc, or for Microsoft to port it to XSLT 1.0 if they like. The benefits to using a domain language like CSL for styling are significant. It becomes easy for users to create new styles, and for developers to create tools for it.

In other news, the XSLT gives insight into the data model, and things are a little better than I’d worried about earlier. The range of reference types, for example, is broader than those in BibTeX. OTOH, types such as “ElectronicSource” start to look quite dated. Most sources these days can be electronic, and the design should reflect that. Also, the model is indeed flat, with elements like b:JournalName.

JSTOR and Closed “Open” Data

Posted in Uncategorized on June 8th, 2006 by darcusb – Comments Off

In the WTF department, one of my library IT guys mentioned this to me yesterday. Cool? SRU access to JSTOR. Not cool? To quote the key bit:

Organizations and vendors wishing to use the JSTOR XML Gateway in conjunction with a federated search engine must sign a Metasearch Agreement.

Sigh …

Citations in “Open” XML

Posted in Uncategorized on June 8th, 2006 by darcusb – 1 Comment

For the past few years I’ve been on a mission to add rich citation support to XML document formats. I do not believe we can improve the experience for end users without it.

First step was DocBook, where Peter Flynn, Markus Hoenicka and I proposed a solution to the DocBook Technical Committee, who reviewed, then approved it, with some modifications IIRC. Recent releases of the format now include this support.

I then worked with Daniel Vogelheim (then at Sun) to adapt the DocBook citation structure to OpenDocument. We basically took the logic of the encoding approach, but added a field to render the formatted citation for display. The ODF TC subsequently approved the proposal, and it is scheduled for inclusion in ODF 1.2, due out sometime next year.

This is how open standards development works in an ideal world. Someone has an idea, it is reviewed by a group of experts, and then agreed to. Typically the idea itself builds on existing best practices. The DocBook proposal, for example, was informed by BibTeX. This open process results not only in a standard solution, but also a technically-superior one that has benefited from peer review.

The last big XML document format I was hoping to tackle was Microsoft’s. To wit, I contacted Brian Jones at Microsoft roughly a year ago suggesting they add citation support equivalent to the new ODF support to their Word XML format. I got a polite reply, but ultimately no action. Likewise, I contacted someone I knew at Apple to ask them to press on this too, given that they have signed on to the effort to standardize Microsoft’s XML formats at ECMA, and latter ISO. This is too important not to do right. Again, nothing much came of it as near as I could tell.

I was therefore somewhat surprised to learn just this week that MS has added citation and bibliography support to Word 2007. So I downloaded the latest ECMA draft for their XML formats, and sure enough find information about encoding this in XML. At this point, the spec is too vague to be of much use as it leaves out the crucial information about the logic. To understand that, Peter Sefton sent me a copy of a docx file with a citation in it. Here, then, is what a citation looks like in the new format:

For comparison, here is the OpenDocument equivalent:

As with ODF, citations are there, and they are proper dynamic fields. The source data is stored apart from the main content file. This makes it in theory trivial to regenerate formatting in vastly different styles. If footnotes are allowed in those fields, it even makes it possible to switch between radically different styles, such as in-text, and footnote-based.

The bad? Look at line 20 in the OXML screenshot. The first problem is that this information is not included in the ECMA spec. This absolutely must be specified for there to be any hope for interoperability. The second problem is that this most critical information is not encoded in XML! Rather, they are using a series of tokens to encode the information.

Now compare this design to the ODF version, where everything is encoded in elements and attributes, and it is all standardized in the schema, and later the documentation.

So how to fix this? MS is using generic structures to encode citations, so I doubt it would be appropriate to adopt the same approach as ODF. OTOH, it seems sensible to me that they allow foreign-namespaced content in the field elements, and that the information now encoded in the w:inst attribute be handled by dedicated attributes. Include those in the ECMA spec, and the primary problem is solved.

There are two other problems, however, each of which can also be fixed without much pain.

The first is the data model for the source metadata.

It is really critical that MS err on the side of generality and flexibility here or users and developers will be frustrated. Some random comments:

  1. Given Brian Jones claim that OXML has richer support for DC than ODF, it is surprising they fail to use it here. The Extended DC isPartOf structure is very useful for capturing the relational character of bibliographic metadata, and yet it seems their model is flat.
  2. More worrying, though I need to confirm this, is that they seem to have made the same mistake the StarOffice developers made, which is to base their types on BibTeX. If that’s the case, it will be unusable by scholars and students across whoe swths of the humanities, law and the social sciences.
  3. Great to see full Person elements for contributors!
  4. Disappointing to see highly Western (even U.S.)-centric content model for naming. Why not use vcard across the format, and gain international-friendliness for free?

Also, there’s no standard model here, and the author encoding is quite funky. Would be really nice to see if somehow MS might pick up our metadata work at ODF.

The second issue ties directly to the above, and that is how records are identified. Ideally, all citations should use standard uris to identify them; isbns, dois, pubmed ids, etc., etc. can all be represented as standard uris, and they should be here. It makes documents much more portable, and makes it possible to do clever things for users.

So my suggestions for MS and for the ECMA team:

  • the ECMA spec must include details of encoding the citations
  • citation logic should not be stored as strings, but using dedicated attributes and/or elements and include:
    • a uri to link to the metadata record, with preference for using standard schemes such as urn, info, etc.
    • point locators to locate particular sections based on page, paragraph, section, etc.
    • local style (to leave off the author name, for example)
    • prefix and suffix text
  • the content of citation fields must allow footnotes and endnotes
  • The source schema should:
    • use DC and DCQ wherever appropriate
    • adopt a properly relational structure by drawing on dcq:isPartOf
    • typing should be flexible; see http://purl.org/net/biblio
  • Styling. One word: CSL. This format is well-designed (if I do say so myself!), and is a crystallization of a lot of work in figuring out all the complex requirements of real world citation style. It is exactly the kind of thing that needs to be adopted as a standard. I am perfectly willing to submit it to a standards body if that would help.

So there is a lot of promise here, but a fair bit of work to do. All of my suggestions above have real world consequences for users and developers. The solution in general: standardize these pieces—in particular citation coding and style configuration—to do what standards do, which is to set the ground for real innovation and user choice.

Next Generation Citation Support

Posted in Uncategorized on June 7th, 2006 by darcusb – Comments Off

I only learned yesterday that Word 2007 will be gaining support for citations and bibliographies, as will the XML format MS is submitting to ECMA.

Peter Sefton has a quick look at the new support, as well as some general thoughts about what we need to move forward.

I’ll discuss the low-level XML details in a separate post, but I’d like to comment on this point that Peter raises:

I think that we are going to end up with a huge mess in this area, with incompatible implementations of embedded bibliographic data from OpenDocument and Open XML, with no backwards compatibility for Word versions earlier than 2007. Some people at USQ use EndNote, but unless it gets OpenDocument support it won’t interoperable either.

Unless someone can pull all this together in the standards committees, that is.

I think that the open source community should put effort into a more microformat-style approach. My idea is to use hyperlinks as citation markers and make a stand-alone web-enabled bibliography tool (which is where the hyperlinks would point) that can live either on your computer or on a server and can synchronize libraries. This tool would be able to format bibliographies for OpenDocument and MS Word.

I think this reflects a misunderstanding of how the new citation support will work in OpenDocument in particular. There, a citation will consist of two pieces: a source element that contains pointers to metadata records, and a body element that contains rendered citations. It is just a purpose-designed dynamic field, which is basically what Peter is asking for above. The new Word 2007 support is conceptually the same, though with one difference I’ll discuss more later. There really is the possibliity now for greatly enhanced interoperability.

Where I think Peter gets nervous is that OXML and ODF both use package file structures, and in both cases bibliographic metadata can be stored within those packages as separate files. Peter’s right; I think this is a really good thing. I’m willing even to ditch backward compatability in the interest in adopting well-designed and fully open file formats. There wil be short-term pain, but promise of long-term nirvana.

Also, I forgot to add that MS will be adding the ability to open and save Open XML files to previous versios of Office as far back as Office 2000.

My view both overlaps with and diverges from Peter’s, then. My perspective is we need:

  • to strongly decouple citations, from reference storage, from formatting
  • to move citations and formatting into the document format, so they are standardized
  • something like Peter’s suggestion of standardized ids (uris) for identifying citations

All of these together will enhance innovation in the market, and make users’ lives easier.

All of this comes with caveats, though, which will be the subject of my next post.


Creative Commons License Creative Commons License