Innovation and Problems of Metadata Modeling

Someone pointed me to the really promising new Ruby Rails application WEIRD.

The fantastically cool innovation in WEIRD is its annotation support. Rather than work with paper copies in which you write scribbles in the margin and underline key passages, and only later go back and type in notes in your bibliographic database, you do this directly in your browser. Highlight a passage, and the box magically appears in real-time on the screen, and gets stored in the database.

WEIRD makes use of a set of libraries that convert electronic formats like PDF into Flash, and then uses Flash to both display the pages in your browser, and to handle the user interaction that allows highlighting passages.

This is really cool stuff. I also like the attention to detail that allows one to mark a note as private (though wish it had suppport for semantic wiki-like markup of annotation content, and ability to export to LaTeX, DocBook, XHTML, etc.).

However, when dealing with applications for scholars, you need to start at the beginning and ask:

  1. who is your imagined user base?
  2. what kinds of documents do they want to store and annotate?
  3. what kinds of workflows make their work best, um, flow?

On these counts, I think WEIRD has a ways to go. As is far too typical in bibliographic application development, the imagined user base is a narrow one: for the most part people from hard science or technical fields. No where does it make room for, say, law students wishing to work through case law, or historians who work with archival documents, or media studies people who want to annotate, say, photographs. Or, in my case, the scholar who wants to post chapters of a manuscript for comment.

To understand the issues, we again must look at the lowest levels of the data model. Consider this from the model/articles.rb file:

class Article < ActiveRecord::Base

# Relationships belongsto :user hasmany :comments has_many :annotations

The database structure reflects this, with the following tables:

  1. annotations
  2. articles
  3. comments
  4. users

The bibliographic metadata itself is stored as a single string; as BibTeX! So once again we have YABA, with all of the baggage goes along with relying on a horribly broken data model.

A few months ago I sat down and tried to figure out a minimal SQL schema I felt would be simple enough to implement in these sorts of applications, but general enough that I could actually use the application that resulted. Here’s the tables I came up with:

  1. agents (person or organization)
  2. bibitem
  3. contributor (joins an agent to a bibitem)
  4. relations (joins bibitem-bibitem; article to journal, chapter to book, etc.)
  5. notes (perhaps distinguishing annotations from comments makes more sense in a WERID context)
  6. users
  7. topics

So a journal article would be a bibitem with a contributor with role of author and a genre of article, which has a relation of isPartOf to another bibitem with genre of academic journal. Most of the citation-specific metadata like volume, issue and page numbers would be stored in separate rows in the main bibitem level.

This is, mind you, a minimal acceptable schema from my standpoint. A more ambitious model would be based more closely on the FRBR—as is LibDB—and results in an additional three tables, with what is bibitem above broken down into:

  1. work
  2. expression
  3. manifestation
  4. item

I’m personally not convinced an XML DB like eXist or Berkeley DB XML isn’t a better approach to storing and querying bibliographic metadata and annotations, but if people must use a RDBMS, they really need to sit down and think carefully about the data model, and how best to exploit the unique strengths of the storage technology. Bibliographic data is NOT simple, and basing an application on BibTeX is a sure way to limit how broadly it might be used!

5 Comments

  1. Hi Bruce,

    I agree with you fully (at least in theory ;-). I think that splitting host (i.e. book/journal/series) information (like publisher, place, issn/isbn, host editor, language, host URL/DOI, etc) from a bibitem is feasable and desirable.

    However, it gets WAY more complicated when dealing with authors which would require an “authority” system. I haven’t taken a closer look at LibDB but I imagine that a good (and fool-proof!) user GUI that takes care of catching the right author from an author list is pretty complicated to implement. And what if the author isn’t present in the authority list? How is this handled in the library world? Who decides about the correct authority ID for a “new” author?

    Thanks, Matthias

  2. Bruce says:

    Hi Matthias — you’re right about the difficulty of normalizing names in particular. Still, the work has to b done somewhere; either on teh front-end or the backend.

    Indeed, I just spent HOURS editing MODS records that were originally created in Endnote. The problem was not so much names, as basic structural issues related to how to represent archival records. When I entered them in Endnote way back, the app doesn’t make it easy to deal with these records, so I entered the data inconsistently, and in some cases just wrong.

    So now to format my book, I have to go back through by hand an correct each of these records.

    Good metadata is difficult to create, but it does pay off I think, in more functional applications (in my case, the formatting end).

    I do think xml-httprequest autocompletion could help a lot with an entry GUI. So user starts to type a name, and existing db names are presented as completion options. Each name gets its own form field.

    As for authority control in the library world, my sense is that the name is telling: trained catalogers make the decisions, base on detailed rule books.

    When MADS starts to be widely deployed, this might have pay-off for us too.

  3. Bruce, thanks for your notes!

    Each name gets its own form field

    Yeah, that would ease things from a developer point of view. But now imagine a user inputting a record with MANY authors (worst case being articles like “The Sequence of the Human Genome” published in Science 2001 with 274 authors!). How would you deal with things like that? Certainly, an “Add Author” button wouldn’t do. While this example is somehow exaggerating, I think it gets to the point. The user will suffer from any non-perfect implementation of an authority system.

    Plus, there has to be a mechanism to store the order of authors for a given bibitem.

    As for authority control in the library world, my sense is that the name is telling: trained catalogers make the decisions, base on detailed rule books.

    Ok, but now imagine a small scientific institute that has setup a bibliographic web database to manage their references and citations. A user wants to input a new record with an author who’s not in any authority list provided by the database interface. What is he supposed to do? For sure, he won’t be able to consult a professional cataloger to add this author for him. I don’t know anything about MADS (except the name and its intended purpose) but how would this help my user with a problem similar to the one outlined above?

    Thanks again, Matthias

  4. Bruce says:

    You raise good questions Matthias, which partly reflect the different needs of different fields. I was just reading an MA thesis proposal in which the student was citing an article with like 20 authors, but he hadn’t learned the magic of “et al”!

    On the 20-author problem in a GUI: I agree my solution could be a problem, but it could be mitigated by an efficient way to suck in metadata from elsewhere such that a user didn’t have to manually enter the reference data.

    Almost all articles published in the last 5 or 10 years have online bibtex/ris/endnote/pubmed data that at least provides a good starting point that one can fine-tune. I seldom create my own MODS records for journal articles.

    For authority control, you just try to enforce some rules and hope people follow them.

    RefDB, BTW, normalizes names in the database. So it is possible, even if it doesn’t (yet) have a GUI interface.

    MADS can help because I can imagine a future in which one has web service access to XML authority files.

  5. On the 20-author problem in a GUI: I agree my solution could be a problem, but it could be mitigated by an efficient way to suck in metadata from elsewhere such that a user didn’t have to manually enter the reference data.

    Yes, that would be a viable option. But if data are imported from an external source, who is going to decide to which authority entry a particular author belongs? How can the software tell that the imported author “Steffens, M” is “Steffens, Matthias” from Kiel, Germany, and not “Steffens, Michael” from Arizona, USA, or (even worse), not “Steffens, Matthias” from Berlin, Germany?

    I think the only possible way to resolve this problem is that content publishers like Nature/Science/Pubmed/etc provide the correct authority information along with the bibliographic data. Then importing/mapping is a snap. This would also remove the burden from the end-user or from individual database applications (except for providing support for an appropriate mapping mechanism).

    Matthias


Creative Commons License Creative Commons License