Posts Tagged ‘RDF’

The Babel of Citations

Posted in Technology on March 1st, 2009 by darcusb – 4 Comments

I’m prompted to consolidate thoughts I’ve been thinking about for awhile by a recent post to the OpenDocument comment list from Alex Brown. In it, Alex correctly observes that [t]he modelling of bibliographic citations in ODF is totally inadequate for real-world content, and suggests instead that ODF must either remove the existing inadequate model, or replace it with a model which is fit-for-purpose; preferably one based on existing actual or de-facto standard.

I support Alex’s basic analysis, though have a somewhat different conclusion that keeps with the spirit of his post. I’d suggest removing the current support AND also adding more substantial support via the new RDF-based metadata support coming in ODF 1.2. Using an RDF vocabulary like bibo heavily reuses existing standards like Dublin Core and FOAF, and only adds those domain-specific types and properties they are missing. It also means it include native RDF-extensbility. If a developer needs to encode some data not appropriate to include in bibo or DC or FOAF as a whole, they can simply do so without breaking things.

Here’s the thing, though:

Simply having standard support in a document format is not enough; you need to entice developers to actually use them. And the evidence from the OOXML world is that is not happening. I am, for example, aware of three different third-party bibliographic applications that can work with Word 2007/2008: Zotero, Mendeley, and Endnote. None of them use the standard OOXML support for citations and bibliographies, and all of them use their own custom fields.

The upshot: an absolutely unacceptable tower of babel. Users cannot collaborate on their documents because the citation fields are specific to different applications.

So, yes, some guidelines and standards for OOXML and ODF would be valuable. But this is not close to enough. Consider this very practical use case that shows both where this market really needs to be, and how far away it still is:

Jane starts a document using OpenOffice and Zotero, adding citations as she goes. She sends the document to her colleagues who use Word and Mendeley and/or Endnote, and still another who prefers Microsoft’s built-in support. They also add citations, and then send the document back to Jane. Jane can then add still more citations, change the citation style, and everything updates correctly for the final draft.

This use case is impossible to realize right now. Even projects like Zotero that have been built on the principle of openness, and which support both Word and OpenOffice, cannot support different users collaborating on the same document.

So what do we need?

We need applications developers to build the APIs that makes it easy for these developers to use the standard fields and metadata support.

I’m looking at you, Microsoft, where you have a citation and bibliographic API that is not really serious about opening up opportunities for third-party applications.

I’m looking at Apple, whose Pages application has a closed API for use only by Endnote.

I’m looking at OpenOffice, who I hope is successful in contributing towards some of this with the forthcoming RDF support.

I also think third-party projects like Zotero and Mendeley and Thomson Reuters need to raise the priority of interoperability. Doing so effectively can also be in their own self-interest, as it could mean less development resources needing to be poured into having to maintain separate processing code bases.

So in short, let’s add richer support to ODF, but let’s also see different developers contribute towards realizing the use case I outline above.

OpenVocab and Bibo 1.2

Posted in Technology on December 3rd, 2008 by darcusb – Comments Off

Two interesting pieces of news on the RDF front …

First, Ian Davis has released a cool DIY RDF property creation web app. You have to see OpenVocab to understand it.

Second, Fred Giasson has pushed out a new release of bibo.

lcsh.info and sparql

Posted in Technology on July 13th, 2008 by darcusb – Comments Off

Ed recently added a SPARQL endpoint for his lcsh.info site. A simple example query to return the concept URI and label for all labels that match a particular regular expression (in this case, those that start with “public”):

SELECT ?concept ?label WHERE {
  ?concept http://www.w3.org/2004/02/skos/core#prefLabel ?label 
  FILTER regex(?label, "^public", "i") 
}
LIMIT 10

I’m integrating some of these concepts into my own RDF and (forthcoming) personal site.

bibo 1.0

Posted in Teaching on June 5th, 2008 by darcusb – Comments Off

Yesterday, Fred announced the first formal (1.0) release of the Bibliographic Ontology. See there for details.

The primary change from previous drafts is in how we handle contributors. This was a difficult decision, but we decided to split the modeling of roles (editor vs. author vs. translator) from that of order. So, for example, we have a bibo:editor property that is a subproperty of dcterms:contributor, and we also have a bibo:editorList porperty to record the list proper.

We also added in some structures from related ontologies to handle events like broadcasts.

OpenOffice 3.0 Beta and Metadata

Posted in Technology on May 7th, 2008 by darcusb – 3 Comments

The OpenOffice project has announced a first public beta of version 3.0 of the suite.

The most interesting among the list of new features from my standpoint? Easily the powerful new metadata support that will accompany the move to ODF 1.2. I spent a pretty difficult year helping move this pretty ambitious new functionality through the ODF TC at OASIS, so it’s nice to see not only that it is making it into the spec, but ODF’s most high-profile implementation.

I don’t believe the new RDF API is in this beta version, but we ought to see it soon enough I imagine. For those that might be curious, the API will just be a wrapper for Redland.

Author Lists

Posted in Technology on April 6th, 2008 by darcusb – 5 Comments

As Fred and I are gearing up to finally release a formal first draft of the bibliographic ontology, one of the biggest decisions we need to make was about how to represent different kind of contributions. When you have a single book author, this is easy to do. But there are all kind of complicated real world examples that make this a difficult issue to resolve.

Let’s be concrete and look at an example from the journal Nature. We have here an article with 22 contributors. The list of contributors in turn has 12 notes attached to it, which for the most part indicate affiliation, but also group what seem to be primary authors. Finally, after the enumerated notes we have a note that indicates the corresponding author.

So the first question is, how does Nature represent this in a standard legacy format like RIS? Answer: they just have an ordered author list:

TY  - JOUR
AU  - Kleinman, Mark E.
AU  - Yamada, Kiyoshi
AU  - Takeda, Atsunobu
AU  - Chandrasekaran, Vasu
AU  - Nozaki, Miho
AU  - Baffi, Judit Z.
AU  - Albuquerque, Romulo J. C.
AU  - Yamasaki, Satoshi
AU  - Itaya, Masahiro
AU  - Pan, Yuzhen
AU  - Appukuttan, Binoy
AU  - Gibbs, Daniel
AU  - Yang, Zhenglin
AU  - Kariko, Katalin
AU  - Ambati, Balamurali K.
AU  - Wilgus, Traci A.
AU  - DiPietro, Luisa A.
AU  - Sakurai, Eiji
AU  - Zhang, Kang
AU  - Smith, Justine R.
AU  - Taylor, Ethan W.
AU  - Ambati, Jayakrishna

How to do this in a more relational model though; say a relational database, or RDF? Both of these are unordered models.

One option is to simply translate this directly to RDF:

<http://www.nature.com/nature/journal/v452/n7187/full/nature06765.html>
    a bibo:AcademicArticle ;
    dc:creator "Kleinman, Mark E." ;
    dc:creator "Yamada, Kiyoshi" ;
    dc:creator "Takeda, Atsunobu" ;
    dc:creator "Chandrasekaran, Vasu" ;
    dc:creator "Nozaki, Miho" ;
    dc:creator "Baffi, Judit Z." ;
    dc:creator "Albuquerque, Romulo J. C." ;
    dc:creator "Yamasaki, Satoshi" ;
    dc:creator "Itaya, Masahiro" ;
    dc:creator "Pan, Yuzhen" ;
    dc:creator "Appukuttan, Binoy" ;
    dc:creator "Gibbs, Daniel" ;
    dc:creator "Yang, Zhenglin" ;
    dc:creator "Kariko, Katalin" ;
    dc:creator "Ambati, Balamurali K." ;
    dc:creator "Wilgus, Traci A." ;
    dc:creator "DiPietro, Luisa A." ;
    dc:creator "Sakurai, Eiji" ;
    dc:creator "Zhang, Kang" ;
    dc:creator "Smith, Justine R." ;
    dc:creator "Taylor, Ethan W." ;
    dc:creator "Ambati, Jayakrishna" .

This is what Ingenta does in its RSS/RDF feeds. The problem here is that you lose order, and hence relative contribution. You also aren’t treating the authors as full objects, but just dumb strings. You can’t, for example, attach affiliation information to them.

Another option is an even more simple de-normalized form; a string with a delimited set of author names. In RDF, you’d basically join the creator strings into a single property.

This preserves order, but this doesn’t get you very far. From the data model perspective, the meaning of the data within that string is totally opaque. You can’t, for example, search based on author name within some programming gymnastics.

The more normalized form would represent the contributions explicitly. So, imagine a contributions table with foreign key references to both an “agents” or “contributors” table and to the “references” (or whatever) table, plus a foreign key reference to a “roles” table, and an integer column that track the “position” within the list. While more complex, this gives some additional advantages, such as being able to distinguish the first three on the list as primary authors, and the rest as secondary. In RDF, a fragment would be:

<http://www.nature.com/nature/journal/v452/n7187/full/nature06765.html>
    a bibo:AcademicArticle ;
    bibo:contribution [
        bibo:contributor [ foaf:name "Kleinman, Mark E." ] ;
        bibo:role bibo_roles:author ;
        bibo:position "1" 
       ]

This has been the agonizing part of designing the new bibliographic ontology. We’ve adopted the second approach by adding an explicit Contribution class. The approach gives a whole lot of flexibility, and maps well to a relational database.

But for legacy data and such, I’d expect some developers might want to use the de-normalized approach above. Thankfully, one can always do both. Triples are pretty cheap, after all, and using one form does not negate the other.

I do wonder, though, if perhaps we need to distinguish among different kinds of contribution, so as to make it easier to scope positions within different lists (primary-contributions vs. secondary-contributions, etc.).

SPARLBot

Posted in Technology on March 9th, 2008 by darcusb – Comments Off

I mentioned RDF and SPARQL in the previous post. I’ve been thinking about SPARQL again in part because of this really cool little IRC bot. From an IRC session:

<bdarcus>   sparqlbot, count graphs
<sparqlbot> bdarcus, I count 75 graphs.
<bdarcus>   sparqlbot, count triples
<sparqlbot> bdarcus, 19888 triples found
<bdarcus>   sparqlbot, load http://twitter.com/Scobleizer
<bdarcus>   sparqlbot, Scobleizer's contacts
<sparqlbot> 1132 triples loaded in 21.5 seconds
<sparqlbot> bdarcus, I found rael, Biz Stone, Evan Williams, sara, Andy Keep, Krissy, Philip Kaplan, veen, Jason Shellen, Sacca, Scott Fegette, Matt Galligan, Jerry Richardson, Mary Hodder, Brian Walsh, Clint G, Jim Williams, Paul Morriss, Ian Hay, Wayne Sutton, nanek, Ross, caroline, Hunter, Brad Barrish, necrodome, Mack D. Male, Nitin, om, steve epstein, Dav...

So basically, the natural language terms like “count” and “contacts” invoke specialized SPARQL queries, and return the results in natural language form. Really nice, and illustrates a world of possibility!

Learning from the Tumblelog

Posted in Technology on March 9th, 2008 by darcusb – Comments Off

As I’ve been looking into revamping and expanding my personal website, I’ve been interested in the Tumblelog. A traditional weblog essentially has one main object: the post. A post is typically a chunk of (typically) text content, with an author, a title, and so forth. A blog is thus a collection of posts, ordered by date.

A Tumblelog breaks out of the single object box. In addition to the post, depending on implementation you can also have links, people, places, photos, music, and quotes. That content can in turn be assembled from other sources: Delicious feeds, Flickr photo sets, etc.

From this perspective, then, a Tumblelog allows one to weave together a range of different kinds of content. So the date-ordered list can include different kinds of objects, but also these objects can be weaved together even within, say, a post.

So what lessons might this have for a scholar? What ideas might I steal from the Tumblelog, and how might I extend them?

I’d say the general approach goes really far. I think I would probably just get a little more generic. For example, a post and an article have little that distinguishes them, except the view. A draft manuscript isn’t conceptually any different than a draft blog post (unless you wanted to model sections). Notes are really just informal content, but still not really fundamentally different. Citations might be thought of as just a special kind of link.

So in the ideal CMS I am imagining, it would weave together links and associated metadata from Delicious and Zotero 2.0*, images from Flickr, and have a project view that allows me to group content and publications.

But what about the details? How to implement this?

In the world of Django, the approach seems to be to have different models for the different content, and then use a generic relation model to be easily able to weave together the content. So, separate classes/tables, for links, photos, quotes and so forth. This approach seems to work well for Jeff Croft, Wilson Miner, and Nathan Borror.

I have to say, though, that after dealing a lot with RDF, a relational database feels a little claustrophobic: having to define an entire model upfront, and to worry about the consequences of changes later. And while I love the automatic Django admin interface, I’m starting to wonder if it’s really worth all the hassle. For a personal site, it’s not like I’m creating and managing that much structured data.

On the other hand, the (currently PHP-based but soon to include Ruby) Chypr project takes a more generic approach, where there is essentially a single object again, but this can be extended. This makes sense, since projects like Chyrp are designed as both dedicated tools, but also to be easily extended with plug-ins.

But given the straight-jacket restrictions of a traditional relational database, exactly how can one store quotes, and events, and images all in the same table? In the current implementation, it seems that extended data is embedded as XML in the database. Ouch, this just feels wrong! Extended data becomes essentially a second-class citizen.

This seems a perfect place to borrow from RDF, either in whole, or in part. One approach would simply be include an RDF store wholesale, as planned in Drupal. With an example like ARC, you can just have a few tables sit alongside the main application tables, and handle all the flexibility you want. If a plug-in developer wanted to add extended data, they could just register the common data in the post table, but then add the extended triples in the generic RDF tables. Since each post gets a URI, it’s easy to then merge the data.

Of course, this raises the question: why not just go all RDF? If my project, publication, image, etc. metadata are all stored as RDF, then creating a Tumblelog could be a simple SPARQL query away.

I hope to figure this all out soon, as I really want to get this new website up and forget about it!

RDF in Drupal

Posted in Technology on March 6th, 2008 by darcusb – Comments Off

Wondering how RDF might enhance the traditional CMS? Take a look at this recounting (and linked screencast) of a recent keynote on adding semantic web support to Drupal.

Reuters and the Semantic Web

Posted in Technology on February 3rd, 2008 by darcusb – Comments Off

The idea of the Reuters Calais semantic web service is upload free text content to a web service, and receive back that content enhanced with embedded RDF. So, for example, let’s say you’re content includes the fragment:

… it will be possible to exchange tolar banknotes (unlimited) and coins (until 2016) only at the Bank of Slovenia.

The service will recognize “Bank of Slovenia” and send back the RDF, complete with a URI for the resource in question:

<rdf:Description 
  rdf:about="http://d.opencalais.com/comphash-1/65c45759-512c-3044-a47f-f74d42f14f4e">
  <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/Company"/>
  <c:name>Bank of Slovenia</c:name>
</rdf:Description>

From what I can tell, the service can only recognize the objects of description; it can’t identify relations. But certainly this is a nice start, and even nicer to see it’s free, and that they’re putting up a $5000 bounty to encourage a practical implementation in WordPress. Would also be nice if they could explore sending back to the content as RDFa-enhanced XHTML (or maybe OpenDocument 1.2 once it’s released and its metadata functionality is implemented).


Creative Commons License Creative Commons License