Modeling References Relationally
Given that I’ve been doing a lot of work on reference metadata modeling over the past year, I’ve been trying to put that knowledge down in a formalized way wherever possible. For the most part, this has been in RELAX NG for XML representations like CSL, and the new RDF/XML representation I’ve recently been working on.
However, I’ve long been calling for better SQL models for this sort of metadata. So I thought it would make sense to tackle this now and release it to the world for anyone to use if I actually get anything useful done. Indeed, it might well fit with how I’ve been thinking about this from an RDF standpoint.

I ran into DB guru Josh Berkus on the OpenOffice DB dev list awhile back, who had challenged my contention that SQL DBs are awkward ways to store reference metadata. Josh has thus been graciously trying to help me work out the structure of the model.
The basic question, as Josh notes, is how to handle the basic modeling of parts (articles), containers (books and journals), and collections (series, archival collections, etc.). I have recently leaned toward the view the basics here can be handled in a single table (title, date, description, etc.), where each level would be a separate row. Contributors and notes and such would be handled in separate tables.
But not so fast, Josh says! A good RDBMS designer takes a rather different, more methodical, approach to design than do people like me, that come from an XML background. He doesn’t want me to worry about abstractions like parts and containers and collections. “Just tell me exactly what you need to store,” he asks.
This is a problem. There is only one spec I am aware of that has this information in a comprehensive way. The problem is that the spec is not easy to read, and its heavy focus on the level-based abstraction may lead DB designers down a wrong path. It is true the levels and relations are crucial, but it is also true that from the perspective of users, they do not care. Citations are about the reference; from that standpoint there is no distinction between an analytical (article) title, and a monographic (say report) title. They are both names for citable resources. If one searches on a title, one should get results from either level.
As Josh notes, there are time constraints here. While I can do this myself, I’m overextended already (as is he!), and the task of compiling a table for each of 40+ reference types with their specific attributes will take more time than I think I have. I did start a table in OmniOutliner a couple weeks ago; just haven’t found time to get back to it. If someone wants to help, let me know.
I can say that some of the more difficult areas of bibliographic metadata if you really want to do it right (e.g. not just use natural language strings) are the following:
- dates: they come in different kinds (publication data, events dates for hearings and conferences, decision dates for legal cases) and different forms (1999, “Auguest 2000″, “September 21, 2001″, “Spring/Summer, 2002″, “June 1, 5-9, 1971″)
- names: again, different types (personal vs. organizational) and forms (think about the international problem of sort order and transliteration, but also even Western names like “J. Edgar Hoover” or “Alexander von Humbolt”)
These problems of course apply across the spectrum of XML, RDF, and SQL.
Creative Commons License