Posts Tagged ‘XSLT’

XSLT 1.0 + EXSLT

Posted in Uncategorized on October 9th, 2006 by darcusb – Comments Off

If I (or, ahem, preferably someone else) was going to port my XSLT 2.0 version of CiteProc to 1.0, this is a hint of how to do it. So like the current version, use custom functions to do the heavy-lifting, and keep the templates as clean as possible. It’s really not possible to do without support for EXSLT though.

XSLT 2.0 Performance Tuning

Posted in Uncategorized on July 29th, 2006 by darcusb – Comments Off

Mike Kay on XSLT 2.0 performance tuning. Gotta come back to this when I have more time to tweak citeproc-xsl. Quite a few of the solutions to some really difficult problems there, BTW, were coutesy of Mike by way of the XSL list.

(X)Forms in Biblilographic Apps

Posted in General on June 15th, 2006 by darcusb – Comments Off

Awhile back I wrote that a new bibliographic web application ought to include:

A configurable form system flexible enough to be configured for any resource type: everything from journal articles to books, to archival documents, to weblog posts. This presumes the form system should not be based on RIS or BibTeX, but rather around a more flexible standard like MODS. Either XML or YAML would be good bets for configuration languages in Ruby or Python.

I probably mentioned the idea of using a simple XML language to configure the GUI elsewhere too. In any case, MS has done just that in Word 2007:

So it seems the entire editing forms are configured with this XML file. In fact, I bet (though cannot now test) that one could add custom types by simply editing this file.

Interestingly, the author definition includes an assocaited XSLT that converts a simple string to properly-structured XML, and another to convert the other way (though I still hate that it all—including the XML—presumes standard Western name forms; what if I am a scholar of Chinese history?). I wonder, can you do this in XForms?

I’ve been saying for awhile that OOo needs to deepen XForms support to open it up to developers for these sorts of uses. This would be particularly interesing when coupled with the idea that a couple of the Sun engineers were discussing at the ODF metadata SC of creating a standard RDF XForms binding for our metadata work. That could GUIs to be essentially auto-configured for custom content.

Plug-in vs. Standard, XSLT vs. CSL

Posted in Uncategorized on June 13th, 2006 by darcusb – 1 Comment

Peter again on citations in Word. Two issues he raises; first about my argument that MS ought to use a CSL (or CSL-like) abstraction on top of a generic XSLT:

Bruce has some concerns about the complexity and size of the XSLT involved, but I don’t think that matters so much -what matters is that XSLT is involved. All that’s required is an CSL to XSLT compiler. Feed CSL in one end and get a Word 2007 compatible stylesheet out the other. This could be done with a stand alone tool.

That would be possible, but not very realistic. It adds further steps to setting up a new style, and as I mentioned, each style file would be verty large. We need to start thinking about open citation style repositories, where a user (or even just a processing tool) can grab a new definition as needed. That is only convenient where the files are:

  • self-contained
  • small

The questions Peter asks near the end (about adding and creating new styles, repositories, etc.) will all have fairly uninspiring answers with the current approach. With CSL, not only do we have a feature-rich language that satisfies the above requirements, but one that is both language and document-format agnostic. One can use the same styles files with ANY document format: Open XML, OpenDocument, DocBook, XHTML, RTF; even TeX.

The second big question Peter asks is whether citations support ought to be standard in Word (or OpenOffice for that matter).

And I’m still dubious about the value of having the bibliographic software built into Word 2007; Microsoft’s site clearly states that if you load a file with citations in it into an earlier version of Word they will be converted to plain text. This means that the feature will not be usable in a real-world context for several years. People have to collaborate with others, work from home and in internet cafes; we can’t mandate Word 2007 in all those places.

First, I think MS can do better than convert the citations to text. I suggest that with their patch to add OXML suppot to previous versions of Office, they include at least basic support to preserve the new citation logic, and perhaps a separate plug-in that provides basic GUI support that would allow compatability with Word 2007.

I cannot emphasize enough how important it is that this stuff be standardized within document formats and included within editing applications. It’s critical, and the sad state of the current market is a direct consequence of the fact that it is not. So I’d emphasize again that I think there’s tremendous promise in this approach, and that it is just in need of some refinement.

Citation Formatting in Word 2007

Posted in Uncategorized on June 9th, 2006 by darcusb – 3 Comments

Yesterday I examined the encoding of citations and bibliographic data in Microsoft’s Open XML formats. Today I’d like to discuss another crucial piece of the puzzle, which is citation style configuration and formatting.

Examining the contents of an example document, I came across the following attribute: SelectedStyle=”\APA.XSL”. This naturally suggested to me they’re using XSLT to do the formatting. A quick ping to M. David Peterson confirmed it.

In general, this is a very good thing. They’re using a W3C standard technology in such a way that it ought to be possible to easily enhance it, or substitute alternate implementations. So if that’s all true, kudos to Microsoft!

Not surprisingly, though, given that I have a little experience using XSLT for these purposes, I have some thoughts/observations.

The first is that bibliographic and citation formatting is pretty complicated, and fully supporting a style like APA using XSLT 1.0 is going to be really difficult. Just take a close look at the output example from citeproc for the APA style. This is hard to do even with the much more advanced capabilities of XSLT 2.0. I’d venture to say it is impossible to fully implement n XSLT 1.0 without extensions.

Even if an XSLT expert manages to program it, it will be almost impossible for even tech-savvy users to create or edit styles in any significant way. I consider my XSLT skills strong, and I find understanding how I’d modify or implement a style really difficult. The code is really complicated.

Just as a hint to the complexity, the archives with the XSLTs—both some generic processing files, as well as 10 styles—weighs in at 2.6 MB (!). The APA.XSL file is a whopping 340 KB. By contrast, the lib directory (which contains all the XSLT files) of XSLT 2.0 version of citeproc weighs in at 584 KB. Though this doesn’t include the CSL files to configure the styles, those are each quite small (my APA style, which AFAIK fully implements the spec, is only 8 KB).

But what this does suggest to me is that it ought to be easy to swap in citeproc, or for Microsoft to port it to XSLT 1.0 if they like. The benefits to using a domain language like CSL for styling are significant. It becomes easy for users to create new styles, and for developers to create tools for it.

In other news, the XSLT gives insight into the data model, and things are a little better than I’d worried about earlier. The range of reference types, for example, is broader than those in BibTeX. OTOH, types such as “ElectronicSource” start to look quite dated. Most sources these days can be electronic, and the design should reflect that. Also, the model is indeed flat, with elements like b:JournalName.

XML Office Taste Test?

Posted in Uncategorized on November 29th, 2005 by darcusb – Comments Off

The Groklaw article seems to have sparked—or at least coincided with—some debate about the technical details of ODF vs. MS XML. One example of that debate is here. As I said in a comment there, I contributed to that article because I was tired of the “but OpenDocument is a nice little format, but not good enough for our needs” argument I keep hearing out of Redmond. I believe it perfectly possible to argue not only that ODF is equal to MS’s format, but superior.

I suppose in considering technical merit you have to have some benchmarks. Top among mine is how easy each format is to transform using standard XML tools like XSLT.

Here might be a good test:

Choose ten random programmers who claim to have XSLT skills. Confirm that at least some of them would consider their skills to be modest; XSLT beginners if you will.

Now, give them 48 hours to write two stylesheets for each format. One that converts from the format to XHTML, and another that converts from that XHTML back to the format. The document in question would be non-trivial; including a variety of different paragraph types, inline styling, footnotes, sections, and images.

Now, compare the quality of results.

My contention is that ODF would win this challenge by a significant margin. Put simply, you will end up with more consistent and better results.

BTW, Dorothea Salo makes a good point that we failed to address: the issue of overlapping tags and well-formedness. Am not sure how big a practical problem it is with Word’s XML.

Web Services and Distributed Citation Processing

Posted in Uncategorized on July 30th, 2005 by darcusb – Comments Off

One of the ideas I stumbled on when writing CiteProc, my XSLT-based citation processor, is that citation processing can be totally decoupled from metadata storage. A simple example of this is how I processed my recently completed book: by letting Saxon query an eXist XML DB over HTTP and using the returned MODS metadata to format the citations on-the-fly.

That was great because I didn’t have to write any code but the XSLT, and it worked! But things start to get more interesting when you think beyond this fairly simple model. Consider two examples that came out of collaborations with other projects:

In the first, Matthew Dovey at Oxford put together a simple web service that takes four parameters: document url, data store type (eXist’s XQuery-over-HTTP or SRU), data store url, and citation style. Here’s an example, where the document is on one server, and the bibliographic metadata is stored on another.

The second example is similar, and a demo is included in the CiteProc release archive. If you run the docbook-test-sru-refbase.xml example with the refbase-xhtml.xsl stylesheet, the processor (Saxon for now) will extract the citations, construct an SRU query, which it issues to a test server in Germany somewhere, returning the corresponding MODS records and formatting them, once again, on-the-fly.

OK, this is starting to look very cool, and very useful. All of the sudden we have an easy, standards-based, path to interoperability!

But given that I’ve been thinking about RDF lately again, I’m imaging extending this further. A simple solution would be a web service that could take a list of references, query distributed RDF stores, and return a collection of MODS records for processing. A more radical solution might be to use, say, a SPARQL XSLT extension to work with the RDF directly from within XSLT.

In either case, my hunch is that there’s a lot of possibility in this idea, and that the old notion of every user having to store and manage their own citation metadata—or conversely that it all ought to be stored on a centralized server—is one that is seriously holding back innovation in this space. Why do I even have to maintain my own citation metadata anyway?

[ANN] XBib

Posted in Uncategorized on June 21st, 2005 by darcusb – 2 Comments

Time to finally announce the stuff I’ve been working on …

XBib provides important building blocks for dramatically improved bibliographic and citation support in XML. The project consists of three key pieces:

  1. Cite: a small namespaced schema for marking up citations in XML; recently approved for inclusion in the OpenOffice file format, it is suitable for embedding in other document formats, including WordML.
  2. Citation Style Language: an XML language for specifying citation and bibliographic formatting. Similar in principle to BibTeX .bst files or the binary style files in commercial products like Endnote or Reference Manager, this styling language has the distinction of being open, easy-to-use, and feature-rich.
  3. CiteProc: a first implementation of a CSL citation-processing engine, implemented using XSLT 2.0. The stylesheets can interact with a data store over HTTP using either XQuery or SRU. Initial supported input formats are DocBook NG and MODS, an XML schema from the Library of Congress. Initial output formats include XHTML and LaTeX, but the driver architecture makes it trivial to add support for other formats. Similarly, it should be fairly easy to port CiteProc to other languages.

The goals of the XBib project are in some sense quite modest. It is not to create complete bibliographic applications. Instead, the focus is on key tools and standards that are needed to push the state-of-the-art on a rather neglected but essential aspect of scholarly needs: citation and bibliographic formatting. By narrowing the focus on these issues, the hope is it will be easier for other projects to better address these needs with minimal work.

On the other hand, the goals are quite ambitious indeed. XBib provides a common framework for formatting bibliographies and citations across markup languages and document standards. In an ideal world, one could use the same CSL files to format DocBook, TEI, OpenOffice, WordML … or even LaTeX documents.

With this announcement, pre-release versions of CSL and CiteProc are available for download. The Cite schema will be published at a later date once it is stable.

XSLT 2.0 and CSS Parsing

Posted in Uncategorized on May 12th, 2005 by darcusb – Comments Off

Here is a nice example of XSLT 2.0’s regular expression support for parsing CSS.

[sometimes I use this blog for note-taking!]

XSLT and Blogging

Posted in Uncategorized on May 11th, 2005 by darcusb – Comments Off

Dave Pawson’s call to the XSLT community to create an open source weblog tool based on XSLT seems to have drawn some interest (see here and here for example).

I like the idea, with one caveat: such a project ought to be an opportunity to show the unique power that XML and XSLT can bring to the blogging mix; not just another Moveable Type clone.

Syncato seems like a nice place to start for a model, but the world has moved along since Syncato was released. Berkeley DB XML now has XQuery support, and AJAX has taken the web design by storm, showing just how good web GUI’s can be.

And I I still would like to see someone, somewhere, try to implement a content publishing system that was not quite so constraining as a blog; something that blurred the boundaries between blogs and wikis.


Creative Commons License Creative Commons License