Names and Dates
In playing with Zotero over the last few days I’m reminded that the two most difficult things to correctly handle in bibliographic databases and the GUIs built on top of them are names and dates.
Consider the following names:
- John van Doe, III
- Mao Zedong
- The Rolling Stones
- Prince
- Senate Committee on Trees, Plants, and Flowers
A database needs to store these names in ways that make it possible to reliably sort and (re)format them.
Name 1 is a more standard Western personal name, with two little twists: an articular (which may or may not be included in sorting, depending on locale), and a suffix. By convention, bibliographic software that is developed in North America or Western Europe assumes these sorts of names, and a very particular notion of the relation between display and sorting. So we have things like first name (secondary key) and last name (primary).
But name 2 points out one problem of this: not all languages have the same sorting conventions. For the name “Mao Zedong” (a transliterated Mandarin name) you sort on “Mao.” This is actually easier in many ways, since sort and display are equivalent, but not if you assume first/last names. Yes, “Mao” was his “first” name, but not at all the same kind of first name as mine.
Names 3 and 4 also throw a wrench in standard expectations; the first is a group (an organization is another sort of group), and the second a pseudonym. Name 5 shows that with group or organizational names, you have to ignore standard delimiters like commas.
So if you have fields like first and last name, you’re already severely limiting what kind of data can be stored. If you just have a single field for names or dates, then, you have to be really careful to make it clear to users how they should enter their data.
My preference on names would be a single field with a GUI hint on how to enter (as sort order, so “Doe, Jane” or “Mao Zedong”), a checkbox to indicate a group (to switch off parsing), and then a tooltip that showed the display name (how the software is parsing the name string). That seems to give the best balance of structure and flexibility.
Dates are also problematic in quite similar ways, because they just don’t fit the neat boxes of standard datatypes. Consider:
- November/December 2000
- Spring 2001
- Second Quarter, 2002
- c. 200 BC
Here I’d prefer four separate fields: year, month, day, other. This is how RIS handles dates, and it seems the best balance.
Alternately, I could image a single field, though it might be a little tricky.
Creative Commons License
[...] A solution to my discussion of the name problem in citation metadata, in Ruby code: [...]
[...] From Ed Summers, news of a new terse metadata format. If you want a really powerful compact metadata syntax and model, I’d say go for RDF N3. But this is a nice and even simpler alternative, whose authors have clearly thought about the hard stuff, like names and dates. Hopefully some of this gets folded back into DC proper. [...]