Posts Tagged ‘XML’

MakeCSL

Posted in Uncategorized on October 6th, 2006 by darcusb – Comments Off

Recent enhancements in CSL have been a collaborative effort between the members of the XBib and Zotero projects. Zotero developer Simon Kornblith helped a lot to refine the language to make it more powerful and more consistent.

The obvious question is now that there’s finally a shipping product that uses CSL, what next? To my mind, we need a nice 21st century analog to the BibTeX tool makebst. Makebst was a script that prompted users with questions and then created a finished BibTeX bst style file. But the interface wasn’t exactly elegant, and the styles aren’t easily distributed.

It’s been my goal all along that we exploit the web and its network effects to quickly build up a rich collection of citation styles, which can in turn be implemented by any interested project.

So what would MakeCSL look like? It would be a web 2.0 app: nice clean AJAX-ified GUI, smart category features, including category-based feeds to keep updated on the latest styles. To create a new style, the user would find the closest existing style, click a “create new style based on” link, and then be guided through a wizard-like collection of questions, with output preview. That style would then be instantly available for other users.

Now, we just need to figure out how best to build it, and where to host a repository, and so forth.

Zotero Team Interview

Posted in Uncategorized on October 3rd, 2006 by darcusb – Comments Off

Dan Chudnov has the latest in his Library Geeks podcasts; this time an interview with the Zotero team. They talk about some of the history of their work, as well as the technical design of Zotero. Oh, and they mention CSL, which they are using for citation formatting.

Semantic Documents

Posted in Uncategorized on September 16th, 2006 by darcusb – Comments Off

Discussing data and metadata lately WRT to ODF, I got to thinking: what if we had document models where metadata was baked in from the beginning?

Florian Reuter has already noticed that from a programmatic standpoint, there’s really no significant difference between metadata and styles: they’re both collections of properties that relate to content.

So let’s forget about document-authoring as we know it, with their awkward and complex UIs and ultimately fairly dumb content. What if instead we just had a model that had containers and text, and container objects had two array attributes: metadata and content?

So forget about bold and italic and headings. Imagine instead you add a section to a document, and you can add any metadata property you want to it: title, date, status, etc. The editor automatically adds the rendered content to your document, but behind the scenes it remains accessible as separate metadata.

Now generalize this to everything: tables, paragraphs, quotes, citations etc.

Hmm … maybe where the next step in semantic wikis ought to lie?

… am thinking about something like this:

p = Paragraph.new
p.add_content("Hello world, here is ")

quote = Quote.new(content="some quote") quote.addproperty(predicate="dc:source", object="urn:isbn:23450934") quote.addproperty(predicate="b:sourcepages", object="23") quote.addproperty(predicate="dc:subject", object="http://ex.net/subjects#random_thing")

p.add_content(quote)

p.add_content(", and the end of the sentence.")

s = Container.new s.addproperty(predicate="dc:title", object="Title") s.addcontent(p)

… no need for section heads, or even citations (we’re talking true “smart quotes” here!); it could be automatically rendered for different output.

Apple and OpenDocument

Posted in Uncategorized on September 16th, 2006 by darcusb – Comments Off

I’d heard something about Apple supporting OpenDocument in their forthcoming version of Mac OS X, code-named Leopard, but chose not to say anything about until it’s been more widely reported. I guess a blog post counts.

It remains to be seen just how good and comprehensive Apple’s new ODF support is, and I really don’t know, but for sake of argument, let’s say:

  • it’s quite good
  • supports import/export
  • is baked into Cocoa, and so available to any OS X application developer
  • Apple productivity applications like Keynote and Pages use the support read and write the format

And let’s imagine further that they do something similar with the ECMA Open XML spec.

I doubt the real-world reality quite matches the above, but just imagine the impact.

Now, we just need to get Apple to add standard citation encoding to the text object too, and all of the sudden I have a compelling reason to be excited about Apple again.

Feedback to TC45

Posted in Uncategorized on September 5th, 2006 by darcusb – Comments Off

Following is the feedback I just sent to the ECMA TC45, which is overseeing Open XML.

I’ve posted my analysis of the bibliographic support outlined in the latest draft here.

In rough order or priority, I think you need:

1) to change the personal name model from first/middle/last to a more international-friendly given/family/prefix/suffix/other/sort-string.

For a specification that aspires to be an international standard, it’s simply unacceptable to be using a narrow, culturally-specific, personal name model.

The solution is simple: borrow from vCard, which is well-designed and widely-implemented, with the following name properties: given/family, honorific prefix and suffix, other, and a sort-string property to account for different sorting conventions).

2) Rationalize your type list for bibliographic sources (see my post), and allow them to be extended

3) provide rules for property extension so that developers aren’t forced into an all-or-choice of what is now a very limited model

4) ideally, you need to bring the bibliographic metadata representation in line with the metadata descriptions used for the OXML document per se, and with wider standards.

For example, you can use dc and dcterms for a lot of these properties, and in particular for critical relations (notably dcterms:isPartOf and dcterms:siVersionOf) that would make the model both more flexible and more robust. Bibliographic metadata is really not flat, and the current model imposes significant limitations that will have an impact on users and developers.

None of these changes are in any ways onerous to implement in either the spec or in software, but will significantly enhance the usefulness of the bibliographic support in OXML.

I would hope too that thinking in terms of more general metadata support can also improve other areas of the spec WRT to metadata (if, for example, you have similar problems with name models elsewhere).

XPath-ing an RDF Profile

Posted in Uncategorized on August 14th, 2006 by darcusb – 1 Comment

I’ve been working on some stuff for the OpenDocument metadata group, including an RDF profile amenable to XML processing, simply to show what might be possible. I was working on an XSLT to demonstrate how it could be processed using standard XML tools, and also how I might model the constraints in Schematron (having already figured it out in RELAX NG), so naturally had to figure out how to write generic xpaths for basic structures like resources, properties, and so forth.

Here’s what I came up with as a start …

All resources:

//[ and not(preceding-sibling::/text()) and not(parent::/@rdf:about)]

Am not too fond of this one, but it works with my example documents.

All properties:

//[not()]

That’s more like it!

Would need more work to come up with a robust, generic, RDF profile validator using only xpath (e.g. Schematron), but it seems not too hard. In any case, it’s certainly easier that writing a generic XML metadata validator!

Extensibility?

Posted in Uncategorized on August 12th, 2006 by darcusb – Comments Off

What does it mean to have extensible XML suipport?

This is a question that came up somewhat obliquely in the latest OpenDocument Metadata SC conference call, where I was presenting my draft requirements for the bibliographic use case, one of which was the need for extensilbility. XML, after all, is an acronym for eXtensible Markup Language. Given my focus on metadata, I’ll restrict myself more to that realm.

It seems to me there are largely two views on this question. One perspective—I’ll call it the “document-based” view—says that extensibility is defined first through the simple ability to create new languages, and second within those languages to create strategic extension points.

Another view—I’ll call it the “module” view—sees metadata not fundamentally in terms of documents and complete schemas, but rather in terms of modules of descriptions that can be plugged together, mixed up, or otherwise interact, mostly independently.

This first view suggests to me an image of a book, complete with introduction and conclusion, index, and covers. It’s a more hermetic view of metadata.

The second view is, I think, the view of the web and hyperlinks, RDF, and more recently microformats. Why invent invent elaborate new schemas, this view says, when you can instead mix-and-match from a rich set of existing alternatives?

So when we at the Metadata SC talk about “extensibillity,” then, as a requirement, what do we mean?

I can only really speak for myself, but to me—a partisan of the second view—extensibility has to mean both that one can add custom XML markup and that the markup conforms to some rules such that ad hoc mixing and interaction is possible.

Simply allowing anything-goes addition of arbitrary content achieves little that is useful. While there may well be use cases for this sort of thing—Microsoft’s custom schema functionality surely must be valuable in some contexts—it seems to me it would be counterproductive to not insist on some minimal expectations of interoperability across a document format’s metadata format.

This is not to say that all conforming applications must fully understand extension structures, but it is to insist on the need for at least minimal legibility (for example, the ability to display any foreign content).

RELAX NG, XSD, Schematron

Posted in Uncategorized on August 3rd, 2006 by darcusb – Comments Off

For anyone writing a new XML language in 2006, they are faced with a choice of schema languages. Despite all the marketing and engineering dollars thrown at XML Schema, it is a brain-dead specification; horribly complex where it doesn’t need to be, and incredibly dumb elsewhere. There are all sorts of practical XML constraints that simply cannot be modeled in XSD.

Want to condition the validation of child elements on an attribute? Sorry, you can’t do that.

Want to give users a choice between an empty element with attributes, or text content without attributes? Nope.

Want to define a content model where order is unimportant? Sorry, you can’t do that either.

Thankfully, there is a better alternative in RELAX NG. Here’s an example (using the compact non-xml syntax) from my Citation Style Language (CSL) schema, where I condition validation on a root class attribute:

  CitationStyle =
    element cs:style {
      AuthorDateStyle
      | NumberStyle
      | LabelStyle
      | NoteStyle
      | AnnotatedStyle
      | CustomStyle

The AuthorDateStyle pattern is then defined like:

  AuthorDateStyle =
    attribute class { "author-date" },
    Info,
    Terms?,
    Defaults,
    AuthorDateCitation,
    AuthorDateBibliography

So the AuthorDateStyle class requires a citation and bibliography element, the sort element child of bibliography must be set to “author-date” and so forth. The schema reflects the expectations tools developers should bring to the table in designing scripts, or GUIs, or whatever.

But what happens if you need to provide your RNG schema for validation in XSD-oriented workflows? Here’s my own conclusion:

Define the schema in such a way that it is easy to create a customization that overrides the more complex restrictions; a simplified schema that Trang can automatically convert to valid XSD. It’s as a simple as:

include "csl-alt.rnc" {
  cs-citationstyle =
    element cs:style {
      attribute class { cs-classes },
      Info,
      Terms?,
      Defaults,
      Citation?,
      Bibliography?
    }
}
cs-classes = "author-date" | "number" | "label" | "annotated" | "note"

… where all of the above patterns are simple ones without content restrictions that will make XSD choke.

Trang will then happily create a valid XSD file from this simplified schema.

However, you end up with a much looser schema, so now what? It’s hardly much use to be creating instances against such a loose schema, where they may be invalid against the normative spec and schema.

Answer: create some separate Schematron rules to model the constraints that XSD cannot. If you want to write it within your RNG customization schema (which can then be extracted using Trang + XSLT), then just do stuff like:

    s:rule [
      context = "/cs:style[@class='author-date']"
      s:assert [
        test = "cs:bibliography/cs:sort/@algorithm='author-date'"
        "Must use author-date sorting for the author-date class."
      ]
      s:assert [
        test = "name(cs:citation/cs:layout/cs:item/*[1]) = 'author'"
        "The citation item layout must include an author element first."
      ]
    ]

Finally, write a little shell script to run both validations.

Not nearly as elegant as the pure-RNG approach (certainly does little for any real-time validating IDE’s I know of), but it can assure that the instances match the expectations modeled in your RNG schema. And learning a little Schematron is probably good anyway, because it in turn can express things that RELAX NG cannot.

Am personally hoping not to have to have to do this with CSL though; it’s enough for me to worry about one schema.

CSL Progress

Posted in General on July 29th, 2006 by darcusb – Comments Off

I jsut stumbled on this analysis of citeproc. Alas, it requires I subscribe to Passport to leave a comment, so I thought I’d instead post a quick update on progress related to citeproc and CSL, since public documentation is rather dated.

My focus has always been on CSL, and the XSLT 2.0 implementation as a solid proof-of-concept. To my mind the promise of CSL is really langauge and document format independent citation styles.

I have been struggling my way to the 1.0 finish line with the schema, trying to finish with some tricky features, and to wherever possible simplify and rationalize the logic so it is easy for styles authors and developers to work with.

To that end, I’ve gotten a lot of help from one of the Firefox Scholar developers, who is implementing support for CSL in Javascript. Here is a screenshot of a development version:

When they release the extension sometime in the next few months, expect it to support CSL out of the box for citation style configuration.

Alongside that, Johan Kool jumped in and decided to work out a Python version. With the three of us working on design details of CSL—and each using different languages—we’ve mananged to make a lot of progress in resolving some of the more difficult problems. We are targetting a pre-1.0 test release sometime in the next couple of weeks, and then a final 1.0 freeze early September.

At that point, hopefully things get more interesting, and I can sit back and watch how others make use of CSL.

What’s new?

  1. the “info” metadata element uses the same content modeling as that in Atom
  2. the whole thing is designed on a consistent inheritance model that makes it simple to do the common stuff, and possible to do more complicated customization
  3. in part as a result, the data field and template models are simpler; constructing GUI editors ought to be easier
  4. I finally figured out how to support complex internationalization options without making things more complicated for those who don’t need them
  5. at the request of Matthias Steffens from the RefBase project, we figured out a fairly elegant localization approach

More soon.

New FUD Offensive

Posted in Uncategorized on July 27th, 2006 by darcusb – Comments Off

It seems Microsoft is gearing up for yet another new anti-ODF FUD offensive, and Brian Jones is leading it. I find responding to every little detail tiresome, so will just address this point about how the standards are developed:

I think the key here is for everyone to just be clear on the goals. The ODF format is based on Sun’s StarOffice, and Open XML was based on the Microsoft Office formats. Both have the goals of being open, both have been submitted to standards bodies, and both have a commitment from the donating companies (Sun and Microsoft) that there will be no licensing restrictions and anyone is allowed to freely use the formats.

This is classic FUD: factually true enough, but false by way of omission.

If you want to understand the goals of ODF, just read the TC Charter. There are a few goals which are notably absent from Microsoft’s, notably friendliness to processing using XML tools and the reuse of existing standards. I happen to think those matter to developers and ultimately users.

FWIW, I am on the ODF TC. But I have also given MS plenty of constructive comments on the way they are implementing citation support in OXML, because my interest is in promoting better solutions in general. I’d rather have two excellent open XML formats, than two weak ones.

Perhaps this will be a good test of how well the two standards processes work? My guess is none of my comments will have any effect on OXML.


Creative Commons License Creative Commons License