Posts Tagged ‘XML’

Linked Library Data

Posted in General on September 26th, 2007 by darcusb – Comments Off

Ed has two recent posts that ought to get one thinking of the possibilities of libraries—and in particulaar big data providers like OCLC and the Library of Congress—getting on board the semantic web train. The first is a more high-level goal of the open data movement, complete with nice diagram. The second is a much more grounded example of the kind of practical things that can make it happen that he and I put together. Allow me to illustrate from my command-line:

$  xsltproc \
http://inkdroid.org/data/identity-foaf.xsl \
http://orlabs.oclc.org/Identities/key/lccn-no99-10609 | xmllint --format -

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:srw="http://www.loc.gov/zing/srw/" xmlns:foaf="http://xmlns.com/foaf/0.1/">

<rdf:Description rdf:about="http://orlabs.oclc.org/Identities/key/lccn-no99-10609"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/> <foaf:name>Berners-Lee, Tim</foaf:name> <foaf:made rdf:resource="http://worldcat.org/oclc/041238513"/> <foaf:made rdf:resource="http://worldcat.org/oclc/048753874"/> <foaf:made rdf:resource="http://worldcat.org/oclc/040278766"/> <foaf:made rdf:resource="http://worldcat.org/oclc/044933478"/> <foaf:made rdf:resource="http://worldcat.org/oclc/045065386"/> <foaf:made rdf:resource="http://worldcat.org/oclc/044281610"/> <foaf:made rdf:resource="http://worldcat.org/oclc/075964549"/> <foaf:made rdf:resource="http://worldcat.org/oclc/044721973"/> <foaf:made rdf:resource="http://worldcat.org/oclc/036040597"/> <foaf:made rdf:resource="http://worldcat.org/oclc/040938943"/> <foaf:made rdf:resource="http://worldcat.org/oclc/051662536"/> <foaf:made rdf:resource="http://worldcat.org/oclc/122918124"/> <foaf:made rdf:resource="http://worldcat.org/oclc/034829358"/> </rdf:Description>

</rdf:RDF>

It took all of about 30 minutes to do this. Now imagine if each of those target URIs also served up (either directly, or via GRDDL) RDF descriptions of those resources …

Citations and Fields

Posted in Uncategorized on May 26th, 2007 by darcusb – Comments Off

I’ve been having an interesting discussion with people involved in implementing citation processing in Zotero. This is the functionality that allows one to add a citation to your Word or OOo Writer document, and have it and the bibliography automatically generated.

They’ve stumbled on a rather large conceptual and practical stumbling block: how to implement note-based citations. If a user adds a citation to the document and it is automatically rendered as a footnote, is that object then a citation in a footnote, or a citation that is simply rendered as a footnote?

Use Cases

Allow me to explain with some use cases:

Basic Case

A user starts a new research paper. They select a footnote-based citation style. They add citations to the document, and each of them is automatically rendered as a footnote.

They then realize they need to use a different citation style, and choose instead an APA in-text author-date style. The footnoted citations are then automatically moved into the text in the proper form.

Complex Case 1

Users wants to add a footnote to the document and include one or more citation references in it. They add the footnote, and then add both their commentary and the related citations. If they switch to a non-note-based citation style, this footnote remains a footnote; only the citation rendering changes.

Complex Case 2

User wishes to add commentary about the citations in the note to that note (as opposed to in the body text). User clicks in the body of the footnote and begins typing. If they switch to a non-note-based citation style, this footnote also remains a footnote.

Discussion

Citations can occur either in the main body text, or in notes. Whatever the citation style, (rendering of) citations in notes are different than body text citations, because they occur in the context of note-based commentary. Their position in the note is thus not an artifact of the citation style, but rather fundamental to the content. Both the content of that note and its citations will remain in the note regardless.

There is no disagreement about the basic case. We all agree citations should be automatically footnoted in note-based citation styles. This is not some theoretical problem. Some fields use both note-based and in-text author-date styles, and absent automation, users wishing to switch from one to the other would have to manually move every single citation in and out of their notes, a tedious process. We all agree it’s a major shortcoming of existing applications (like Endnote) that they do not manage this issue for their users.

Where we diverge is on implementation details highlighted in the complex cases.

Complex Case 1 illustrates the clear distinction between the two: it is a citation within a footnote, rather than a style-dependent footnoted citation.

Complex Case 2, however, demonstrates a likely case where the user in essence might want to convert a footnoted citation into the first form.

So two different issues of concern to me:

First, what should the user experience be here when a user would like to add commentary to citations?

Forget about footnotes. Consider short comments in in-text citations? I want to do (Doe, 1999; see also Smith, 2000, chapter 2). Can I do this? If so, how? If I do, how do I select the citation source?

Note: my questions above do not necessarily presume any answers. I am asking, though, because users sometimes do use notes in in-text citations.

Second, how should this be encoded in document formats (specifically ODF and OOXML) such that users can be confident of some acceptable level of interoperability in citations across different applications?

The debate we’ve been having touches on both dimensions of the question, but a bit more on the latter. In short, should a citation field in ODF or OOXML be allowed to contain a footnote or endnote, or must the citation always be wrapped in the note?

Allow me to illustrate using the new text:meta-field from the ODF metadata work. Let’s imagine a multi-reference citation with an author-date style. It might be done like so:

<text:meta-field xml:id="citation-1">
  (<text:meta-field xml:id="citation-1-r1">
    Doe, 1999
  </text:meta-field>;
  <text:meta-field xml:id="citation-1-r2">
    Smith, 2000
  </text:meta-field>)
</text:meta-field>

So we have a nested field. These fields are then hooked up (via a binding that uses the xml:id) to some RDF/XML in the file package.

To a user, this would display like:

(Doe, 1999; Smith, 2000)

They could individually select the references, which would be read-only.

So now: what happens if the user changes to a note-based style?

My argument is that because the footnote/endnote rendering is only an artifact of the processing, and does not reflect a user’s explicit choice, the XML encoding should reflect this by including the footnote within the outer field; something like:

<text:meta-field xml:id="citation-1">
  <text:note>
    <text:meta-field xml:id="citation-1-r1">
      Doe, 1999, Some Title, New York:ABC Books.
    </text:meta-field>;
    <text:meta-field xml:id="citation-1-r2">
      Smith, 2000, Some Other Title, London:XYZ Books.
    </text:meta-field>
  </text:note>
</text:meta-field>

The only time a citation should be contained within a note is when a user explicitly chooses to do so.

So the questions are, I suppose:

  1. Does this make sense from a user-experience and document-encoding perspective?
  2. Can this be implemented such that we can—at least some point in the not-distant future—have interoperability across different editing and bibliographic applications?

To be more concrete, when MS adds support for note-based citations, how will they encode them in OOXML? When OOo developers add support for the new metadata field and citations, how will they do it?

[update: fixed some minor typos]

A Framework for the (Bibliographic) Future

Posted in Uncategorized on March 13th, 2007 by darcusb – Comments Off

William Denton has a link to draft of a Framework for a Bibliographic Future. I don’t have time to read it, much less comment on it, in depth, but just wanted to pick out a few places that raise some concerns.

First:

Metadata schemas (sometimes called ‘element sets,’ ‘metadata formats’ or ‘data dictionaries’) define the actual properties that will carry values in the data set, as well as the relationships between those properties. Data elements can be defined at any relevant level of granularity. They can have hierarchical relationships between them or non-hierarchical relationships.

The problem I read into this is that this is bound to an XML view of the world. The language of hierarchy is just that kind of view, and it excludes the more flexible relational and graph-based views of relational databases and RDF. So in any case, I suggest purging the draft of any suggestion that a tree-based XML model ought to be in any way privileged.

The second follows just after:

FRBR defines data elements in its attributes, but they must be restructured in a way that allows the development of different levels of granularity and that promotes extensibility of the schema, both over time and across communities.

… and:

Crucial to the proper development of a metadata schema is a clear notion of requirements for technical expression of the attributes, and a plan for maintenance and growth. We have learned much in the library community about the importance of community consensus and how to maintain important standards over time.

So the group wants a clean but extensible model that can be serialized in different ways, and integrated with backend systems I presume. They claim the need for “community consensus” that seems to suggest a requirement for centralized development and management.

While the first makes a lot of sense, the second seems more a consequence of limited technologies than a formal requirement. In fact, this is a major problem with MARC, MODS, MARCXML, MADS, etc. Wouldn’t it, for example, make much sense to have a framework that could evolve in a distributed way; where different organizations and communities could extend it as needed without need for wider consensus?

I’m going to repeat my mantra here: look at RDF. It provides the common and extensible model you want here, in ways that are relatively more friendly than generic XML to the relational databases so widely in use. It also can map fairly cleanly to object oriented programming. Finally, RDF also notably does not require the kind of centralized development and management suggested above.

ODF 2.0 and OOXML

Posted in Uncategorized on March 12th, 2007 by darcusb – Comments Off

In comments to a post on an interesting interview with former Massachusetts IT official Louis Gutierrez, Kurt Cagle offers some sensible advice to Microsoft:

If Microsoft really wants to put together a superior product, then it should participate in the ODF standard for a 2.0 version that moves a little closer to its own perceived needs, recognize that their monopoly isn’t going to last much longer anyway if they are at a stage where they are losing customers because of their own inability to work with them on something as basic as interoperability, and compete on quality. I don’t think anyone - the ODF TC, the ISO standards committees, the customers - would seriously have any problem with that, and it would show that Microsoft recognizes that the world has moved away from where it was in 1997.

I see this idea to bring ODF and OOXML closer in line floated in some future ODF spec more and more, and it seems quite a sensible way forward. I would certainly welcome it, and while I cannot speak for the rest of the ODF TC, I wouldn’t be surprised if they felt likewise.

Adobe Mars

Posted in Uncategorized on December 9th, 2006 by darcusb – Comments Off

Elliot Kimber with a nice analysis of Adobe’s new Mars effort:

MARS is an XML-based format that is intended as a functional replacement for PDF…. After seeing Adobe’s presentation and talking to the guys from Adobe it’s clear that what they’ve done is a sincere and well-thought-out attempt to Do The Right Thing rather than a cynical recasting of proprietary stuff into markup so it’s “open.” MARS tries to use standards as much as it can and it seems to do so to a remarkable level of completeness. It uses SVG for representing each page, supports the usual standards for media objects (bitmaps, videos, etc.). Uses Zip for packaging, and so on.

I agree.

There are, however, two suggestions I have for Adobe. First, they should seriously consider using ODF to do the packaging. In fact, there’s already evidence they’ve thought of this. The last I looked, there was an ODF namespace in their manifest, even if it didn’t seem like it would validate against the ODF manifest schema per se. I’m sure the OASSIS ODF TC would be happy to discuss any changes they might need.

Likewise, this ties us back to the metadata question. It’s time for Adobe to seriously reevaluate XMP and move it from being an essentially proprietary subset of the RDF spec circa 2000, to reflecting a more open and more technically-refined and rich RDF of today. Think XMP NG, which is largely XML/RDF proper, plus conventions for embedding in (particularly binary) files.

Now that could be a really interesting combination!

Mozilla 2.0

Posted in Uncategorized on October 16th, 2006 by darcusb – Comments Off

An update on plans for Mozilla 2.0, including this:

For instance, we can get rid of RDF, which seems to be the main source of “Mozilla ugliness”

I agree with the point that it ought to be possible to do without using RDF (or XML) for configuration files. OTOH, I can’t help but think there’s an opportunity to improve the RDF support in Mozilla on top of the new unified storage system built on top of SQLite. Indeed, others have pointed this out. Mozilla’s RDF support is ugly, then, not per se that RDF can’t be done much cleaner, with real benefits to users and developers of more data oriented extensions.

For example, I’m not really sure there’s any reason that Zotero must have been built on the raw SQLite, and couldn’t have instead benefited from an RDF abstraction on top.

Metadata SC Use Cases and Requirements Approved

Posted in Uncategorized on October 13th, 2006 by darcusb – Comments Off

ODF TC chair Michael Brauer has a quick summary of the approval of the ODF metadata use cases and requirements document I edited that will frame the proposal we will deliver sometime in the next few months. As he writes:

It will be the basis for the future work of the metadata subcommittee, and therefore provides an outlook in which direction OpenDocument moves regarding metadata. And because OpenDocument is OpenOffice.org’s native and default file format, I’m sure it also provides an outlook in which direction OpenOffice.org may move.

Open XML Final Draft

Posted in Uncategorized on October 10th, 2006 by darcusb – Comments Off

Microsoft has released the final draft of their Open XML file format specification. I submitted a detailed list of comments to ECMA, and they did respond to them. However, it’s worth noting that they made no substantive changes at all. The most serious problem is that their name model is still U.S./Western-centric. They tried to get around the problem by adding a small editorial comment that a first name is a equivalent to a given name, a last to a family, but I hardly consider this an adequate response. Their continued use of “middle” name is even more annoying, given that it doesn’t even work for many Western names; consider “J. Edgar Hoover.”

I give Microsoft credit for opening up, but make no mistake: Open XML is really not that open. It is designed by and for Microsoft, and it’s clear that all decisions on the spec were driven by their product teams. The team implementing their new bibliographic support couldn’t be bothered to make what in effect were trivial changes, and so the ECMA TC45 couldn’t be bothered to fix the spec.

Contrast this with the OpenDocument process, where the people now driving the future of the specification are in many cases unaffiliated with any of the usual players: IBM, Sun, KOffice. Moreover, all our comments and responses to them are publicly available.

XSLT 1.0 + EXSLT

Posted in Uncategorized on October 9th, 2006 by darcusb – Comments Off

If I (or, ahem, preferably someone else) was going to port my XSLT 2.0 version of CiteProc to 1.0, this is a hint of how to do it. So like the current version, use custom functions to do the heavy-lifting, and keep the templates as clean as possible. It’s really not possible to do without support for EXSLT though.

Barrier to Entry

Posted in Uncategorized on October 8th, 2006 by darcusb – Comments Off

With all the talk of the open standards and free software revolution, the following assessment of Zotero may seem anachronistic:

First of all, Zotero only works with Firefox 2. Yes, not even with regular Firefox as most people use it right now, but only version 2. At the moment, only a release candidate is out, no official version 2 yet. Tilburg University (my employer) has a policy to only support MS programs, so only Internet Explorer is supported. Which means I cannot recommend Firefox to people with only limited computer expertise.

The second aspect of Zotero that needs work, is its integration with Word. The one reason people keep using EndNote, even though they don’t like it that much, is because it is easy to create citations and bibliographies in a text. The integration with Word (again, a Microsoft product which is supported on campus) makes it easy to write an article, insert citations in the proper places, and have EndNote create a good bibliography, using the correct style for a certain publication. Want to submit the article to a different journal? EndNote will change the style of the bibliography for you, in a few seconds.

But it’s actually not. Despite the fact that Firefox is free and better than Internet Explorer, and despite the fact that Zotero is also free and better, this sort of argument is not uncommon at all. It shows just how much weight the MS monopoly places on the market and on innovation. Now guess how hard it is to convince anyone to use OpenOffice.

OTOH, there’s something so incredibly small-minded about all this hand-wringing. The point about integration with Word is not exactly wrong, but a) it will happen (it’s on the roadmap), and b) there’s something so defeatist about the tone. Zotero is free software; if you’re worried about integration with Word — or any other application — then stop sitting around and do something about it! Write a letter or schedule a meeting with the campus IT people who set the brain-dead policies that they will only support Microsoft solutions. Talk to the people who make budgeting decisions and ask them to consider finding a way to direct money away from proprietary solutions into open ones. Talk to your campus computer science department and tell them you want their students to actually help create better solutions. Your institution will benefit in all kinds of direct and indirect ways; you’ll get better and cheaper software, and your students might learn valuable skills that will help them in the future.

If all of us simply worried about what it would take to change the world, nothing would ever get done. Zotero, and CSL, and OpenDocument, and OpenOffice, and RDF is going to do just that in this little corner of the scholarly world.


Creative Commons License Creative Commons License