The Semantic Web, RDF and Scholarly Metadata
Metadata is central to scholarly activity of all kinds. Whether it’s students working on term papers, or researchers writing books and articles, much of that work involves marshaling metadata towards a convincing argument.
And yet, as I have said before, I find working with metadata far more work than I wish it was. More importantly, it’s more work than it needs to be.
Consider the work I do just to be able to gather the metadata to format my citations:
- For some journals, go to site X and find articles I want to cite; download RIS data for each separate article, then use Bibutils to convert them to MODS.
- For other journals, go site Y and find articles I want to cite; download Refer data (they don’t offer RIS) for each separate article, then use Bibutils to convert them to MODS.
- For most books, I can grab MODS data for them directly over SRU from the Library of Congress. Except, the data is often so bad for my purposes (missing name roles, spurious markup, etc.) that I often just create new MODS records by hand.
- For everything else (which is a lot in my case), hand create the MODS records, with a little help from emacs templates.
And in order to be able to use these data, I need to store it in a central location: a bibliographic collection in my eXist XML DB. Nevermind the months I spent writing code to be able to put it all to good use.
I sometimes feel like it’s more work to use “time-saving” web gateways than to just walk over to the library, pull the journals off the shelves, and hand write some notes on what I’m reading. Does it really need to be like this?
Having started to write this, I came across a video that is purportedly an Apple promotional video from the mid-1990s. It lays out a vision for what personal computing ought to look like in the future. Interestingly enough, the video does this by dramatizing scholarly workflow. A Berkeley professor—a geographer no less—enters his office and opens his computer. A talking head begins telling him his schedule for the day, which includes an afternoon lecture on Amazon deforestation. It seems the absent-minded professor had forgotten about the lecture, and so asks (verbally) the computer to pull up last year’s lecture notes. Not satisfied the information is sufficiently current, he asks his talking head assistant to find all recent related work. The assistant responds “only journal articles?” “Yes,” the professor responds.
I won’t recount the whole video, but suffice it to say that this exactly the sort of seamless access to information that I’d like to see sometime before my career is over. And yet, we’re so far from being there that I often find myself frustrated.
It’s within this context that I observe a rather old debate playing out with respect to the library world. I suppose I started it by asking an innocent question of Kevin Clarke’s post on metadata interoperability: “what about RDF??”
From my understanding, the origins of RDF lie with work done at Apple during roughly the same time period as this video was put together. Indeed, the video is essentially all about a vision of a semantic web. It’s telling to me that Apple chose to dramatize that vision using a professor.
Yet when the subject of RDF is raised in the library community, in general the response is either silence, or outright hostility. I’ve yet to hear a single convincing argument why not RDF, and it bothers me that the design of library standards like MODS and MADS suggests that there has been no attempt to make them RDF compliant.
And yet there is some RDF-related movement in the bibliographic world, though most of it spurred by people coming from outside the community. There is the SIMILE project at MIT, of course. And Leigh Dodds—who had some interesting things to say in response to Kevin—is heading up Ingenta’s quite ambitious move towards RDF. That started awhile back when they started serving up PRISM RSS feeds of their journal holdings, but will deepen significantly as they move to an RDF backend.
All of the sudden I can start to imagine a different way. Instead of me having to maintain my own normalized metadata—which takes a LOT of work—why can’t I just create citations that point to resources in disparate locations on the web? Why can’t I have elegant search applications that can find me the information I need—and access its metadata—without me having to access 10 different sites, most of them poorly designed, and each with their own UI excentricies? And for the RDF community, how about ditching BibTeX (with all of its significant problems) and adapting CiteProc to support an RDF-based approach, where one formats one’s LaTeX/DocBook/OpenOffice/Word documents using an elegant citation style language and distributed RDF metadata?
Really; we need a better way! Yeah, I know there are all kinds of institutional and financial and technical barriers to getting where we need to go, but we need to get there, and it seems to me RDF is a better solution than vanilla XML. As I said in a comment to Kevin’s post:
There are some serious metadata issues that the library world needs to grapple with over the coming years. I’m thinking not only about figuring out smooth ways to integrate disparate data, but also to begin to better put it in the framework of a larger view such as the FRBR. To do that well, I think the library world needs to do a much better job of interacting with other communities that are grappling with the same issues. I sadly don’t see even a hint that this is happening with RDF.
And conversely, I would add, the library world can still play an important role in revolutionizing the web more generally; in incrementally helping to realize at least some of the vision of the semantic web. What if, for example, the FRBR became broadly used and understood in the RDF world? What if library authority data became widely used and cited far beyond libraries?
Creative Commons License
For once, the library community is probably correct to be cool toward a technology. I respect Leigh Dodds a great deal, but RDF is a mess and its uptake has been minimal even among the techiest of techies.
There’s better out there. I’d rather see us working with topic maps, personally.
Hi Dorothea,
But how do topic maps solve the problem of metadata interoperability that I–and I’m sure many in the library world–are concerned about?
Not sure if you’re on the MODS list, but the other day I posted a MADS example that was recast as RDF. I would argue that not only was it now RDF compliant, but that it was also better, semantically cleaner, XML.
Still, just to push you a bit: what do you mean by “a mess”?
I’m not concerned with the question of uptake, because I’ve seen that argument used too many times to dismiss better solutions. Witness RELAX NG, which as far as I’m concerned is a far superior validation technology than XML Schema. And it seems to me RDF is being used more-and-more anyway.
Yes, yes and yes
It’s totally ironic to me that RDF is a success story from the library world (Eric Miller then at OCLC, and current lead for W3C SemWeb efforts). It seems that there is a lot of negative energy (especially in the library community) when you start talking about the “semantic web”. And even outside of the introverted library world seasoned semweb developers often find it hard going sometimes.
Keep up the good work Bruce.
Thanks for the thumbs up Ed.
There’s a chicken-and-egg problem here: people like the burned-out-RDFer you mention get frustrated because not enough of the big metadata providers (like libraries!) make their data available as RDF, and because people associated with those metadata worlds then dismiss RDF as empty hype.
And as for the “negative energy”: I sometimes think the library world is a little possessive of its data, and of its standards. But dammit, this is my data too! The semantic web is nothing but a vision of a functional metadata commons, which a) is something we don’t have now, b) I’m convinced is possible, and c) the library world SHOULD have a bigger role in!