Citation IDs

It’s clear the future of bibliographic and citation management is greater interoperability and collaboration; not less. In the future, users will create less bibliographic data, and consume more of it.

For individual documents authors, this raises an issue: how to code citations in such a way that they clearly and unambiguously point to the correct record? In a single-user context, this problem has been generally solved in one of two ways:

  1. a numeric database id
  2. a natural language citation key (e.g. Doe99a

The first approach—used by default in Endnote—has the virtue of uniqueness within a single-user context. Beyond that, however, documents break. An ID of 2312 will point to one record on User A’s system, and another record entirely on User B’s.

I prefer the second approach myself, because the identifier it tied to the content, not the storage. I can look at the citation and deduce the record it refers to. However, the citekey approach has its own problems when you start to scale it from the desktop to the internet. How does one insure that cite{Smith99} is unique?

So, question: what should be a standardized way to identify citation records that best balances these needs? There’s a discussion of this over the BibDesk wiki. I tend to the like the approach they note from CiteSeer, which concatenates the author name, two-digit year, and the title. Example: authorYEARtitle (mccracken03greatPaperAboutStuff). While it is a bit verbose, it seems to be the best approach to me. It is superior to the tradition (which I use!) of appending a suffix to multiple author-year combinations (e.g. Doe1999b) because:

  • more portable
  • more information rich (you can see at a glance the specific record it points to)
  • it doesn’t add punctuation like colons (that could cause problems in some contexts?)

Thoughts?

16 Comments

  1. David Wilson says:

    There could still be a problem with the key not being unique. Two possible, if not likely, scenarios, a paperback and hardcover edition of the same book, in the same year and different page numbers; articles in two difference journals with the same name. The method will have to avoid adding arbitary numbers or letters to make the key unique, otherwise you may as well have doeJ:2001h. Perhaps you permit an larger number of attributes, ie publisher, media etc which could be added to make an otherwise duplicate record unique. Also how about using the isbn if it is available as part of the key.

  2. Dave Howorth says:

    Database IDs only break if they’re designed to. It is trivial to make keys globally unique (GUUID or whatever) which eliminates the ambiguity (i.e. one ID never refers to two documents) but still leaves the question of whether two IDs refer to different or the same document.

    Some global ID systems are designed to overcome this, such as DOIs, and these would be good where applicable but they don’t cover every document.

    Author and date, even combined with title, are not unique. There can be drafts, versions in two publications, revised versions etc. Whilst it is possible to identify these when one is aware of the need in a particular case, you can’t be sure everybody else will always know; indeed you may not know of other versions when you make the citation and therefore be unable to disambiguate it.

    For documents that are online, a URI to a ‘permanent’ copy together with an MD5 signature of the document is perhaps as good as you can do.

    For offline documents, either scan them in :) or refer to a particular copy (British Library, London or whatever)

    Cheers, Dave

  3. It might be worth looking at the file bibshare that comes with some TeX distributions. It proposes a way to construct unambiguous BibTeX citation keys based on the bibliographic reference (rather than the title). I quite like the idea, and use a variant of it for my own XML-based bibliographic database.

  4. Bruce says:

    So it seems there are approaches that are aimed at authors who like work to with and see their code: those (like me) who write in text editors. From this perspective, it’s important that the key be both easy to remember and easy to subsequently interpret. The standard bibtex citekey does this.

    DOIs are the other end of the spectrum; totally opaque to any human reader, but perfectly precise for machines. If everything had a DOI, I’d be inclined to say it might be worth using. Unfortunately, that won’t happen.

    The scheme that Justus points out is a hybrid: it is human readable, but has more specific data that addresses some of the problems of titles that Dave notes. So a journal article becomes “Gilchrist:NAMS-36-9-1199″, where you have author name, an abbreviation for the journal name, and then volume, issue, and page numbers.

    A book (let’s say monograph to be a little broader) includes a title abbreviation, with up to four letters drawn from words other than A, The, etc. So, for a book entitled “The End of the World” by John Doe in 1999, you’d get “Doe:EW-99″.

    I sort of like this too. I still think year should be there for consistency; e.g. “Gilchrist:99-NAMS-36-9-1199″.

    Hmm … I’m starting to think I may need to bite the bullet soon and work out how to modify my database to be more forward looking (probably with an XQuery script). Ugh … perhaps after I finish the book!

  5. Richard Karnesky says:

    If everything had a DOI, I’d be inclined to say it might be worth using. Unfortunately, that won’t happen.
    Maybe we should use it when things do have them. Why make yet another unique ID? Most things have some unique id, be it DOI, LoC, ISBN, etc. You could even prefix them with which sceme of unique ID is being used. The references which haven’t been assigned a unique ID can use a scheme like the one described as a fall-back.

    There should be central databases matching the ID to the reference. DOIs and ISBNs already have this. Any user-created scheme would need a database. If we do this without the support of major libraries, the existing public and commercial bibliography database makers, etc., there is a good argument to leave most of the work to the already-existing DOI, etc. servers out there.

    As far as being easy-to-remember, the user’s needs should triumph. It won’t be needed for word processors at all, but there is an advantage of allowing people to cite in whichever way they find convenient–many already have easy to remember/type schemes. Simple scripts could be used to change the user-ID to the somewhat-standardized global ID given above. There is no way I’d be able to remember or quickly type “Gilchrist:NAMS-36-9-1199″ better than “Gilchrist99″ or even “Gilchrist99c.”

  6. Bruce says:

    Hmm … I wonder if it might make sense to consider making use of the new info uri scheme? In that scheme, a doi would be represented like so: info:doi/10.1000.10/123456. Likewise, a LoC catalog number would look like “info:lccn/n78089035″.

    How’s that?

    I’m starting to think citation keys and numeric identifiers should be separately coded in records.

  7. Bruce says:

    I got this suggestion for serials (magazines, journals, etc.). It uses a combination of ISSNs and other ids.

  8. Joe Lovick says:

    Ok, so i may be a little late comming to this page, and nobody knows who i am but, my 2c, The ID should be short so i can be typed by hand (up to 8 random chars then some pattern). it should also be unique, and follow a pattern….. so could we have a user understandable hashing system to generate the ID? based on author name, title, language and publisher, date of publishment? (sounds a bit like the holy grail to me)

    but how about… hashed(author, title), 3 (or 2) letter code for language, code for publisher -”go on set some standards” and 4 digit year.

    now i cant remember my comp-science but their should be a minimum size for a hash that allows say a 1k author +title combination, using only alphabetical characters to encode the result. but lets say it is 6 chars (wild guess based on md5lengths) so a code would read ASEFGAENGJLP1999

    where ASEFGA is the hashed part.. ENG = English and JLP = Joe Lovick Publishing, 1999 = the year.

    If a suitable host could be found then weblook up service similar to freedb.org could be used to complete / check any hashes/publishers against known documents.

    also macro’s could easily translate too and from the hashed field into a more readible form,

    anyway this is just my 2c,

  9. Erik Wilde says:

    hi there.

    we are the sharef folks (http://dret.net/projects/sharef/), and since we strongly support personal bibliographies and are deeply into xml, our approach is to use xml namespaces and qualified names. i know that xml namespaces are ugly, but they are also very useful, and they happen to be the ideal vehicle. any reference is qualified by a namespace name, which identifies the person creating the reference, and the reference itself, which identifies the reference according to the person’s referencing scheme. this way, nobody has to be forced to use someone else’s referencing scheme, and everybody will be happy. at least that what we hope will happen ;-)

    cheers,

    dret.

  10. Richard Karnesky says:

    Hmm … I wonder if it might make sense to consider making use of the new info uri scheme?… How’s that?
    Yes–exactly. Sometimes I feel as if someone else already came up with (and even implemented) all of the best ideas already!

    The only novel element that should be added is a quick and easy way to add a new record to some sort of system. But this is somewhat difficult to also do inexpensively and authoritatively. I see mistakes in lists of a dozen references–who knows what would happen if everyone could add their own list.

    I’m starting to think citation keys and numeric identifiers should be separately coded in records.
    I agree completely. There is value in easily using numeric identifiers as URLs, and it is OK for them to be somewhat complicated if they are unique. But for anything you hand-hack (LaTeX, some XML formats, etc.), it is much more important to not confuse the author. Short, personal keys are a must.

  11. Bruce says:

    Erik — but what happens if you have three author collaborating on the same piece, each using different id schemes? (of course, this says nothing about the complex issues — which Rick notes — around the integrity of the metadata; what the library world refers to as authority data; anyone looked at the new MADS schema the LoC is working on as a companion to MODS?)

    I think I want to figure out a way for my XSLT stylesheets to be flexible on this count. Am just not sure of the precise details yet (except I like the info scheme).

  12. Doug M-C says:

    The point about the info uri scheme is that the uri actually doesn’t reference anything.

    So the use of the info uri would be ONLY as an ID. Every data store then would be encouraged to include an info uri ID (or two or three).

    When I am then writing a paper, I include the info uri as the ID of the work being cited. It is then up to me to configure my software to search the data stores of metadata I already feel comfortable with when creating the footnote or the bibliography. I may include Bruce’s and the LoC but maybe no one else. Or I may include hundreds of personal metadata stores on the web and put up with the inconsistencies between them!

    Obviously, I would publish my own metadata store and everyone would be lining up to include mine in their accepted lists !!!!!

  13. Peter Ring says:

    In the discussion, I believe it to be rather import to distinguish between the various forms and purposes of citation identifiers. A local shorthand ID referring through a local bibliography need not be a truly universal and persistent identifier. The information supplied to an OpenURL server need not be complete, and do not form an unique string. On the other hand, the DOIs that e.g. CrossRef is based on are true persistent identifiers. The FRBR identified four end-user tasks: to find, identity, select, and acquire an entity. We should not expect a single identifier scheme to fit all purposes.

    I’m mostly interested in citation identifiers for serials, specifically for the individual articles in e-journals, blogs, and the like. It’s a mess …

    You propably already know about

    SICI code (Serial Item and Contribution Identifier) http://www.niso.org/standards/standarddetail.cfm?stdid=530

    There’s a list of some common citation identifier encoding schemes at:

    Identifier Encoding Schemes http://epub.mimas.ac.uk/DC/citids.html

    Another overview:

    Using Existing Bibliographic Identifiers as Uniform Resource Names http://www.ietf.org/rfc/rfc2288.txt

    And yet another overview:

    Identifiers and Identification Systems An Informational Look at Policies and Roles from a Library Perspective http://www.dlib.org/dlib/january04/vitiello/01vitiello.html

    Another seminal paper from D-Lib:

    Reference Linking for Journal Articles http://www.dlib.org/dlib/july99/caplan/07caplan.html

    A discussion about various identifier schemes and examples of application to a variety of resources:

    NLA Guidelines for the Development and Application of a Persistent Identifier Scheme for Digital Resources http://www.nla.gov.au/initiatives/persistence/PIappendix1.html

    Some more references about identifiers:

    Unique Identifiers: a brief introduction http://www.bic.org.uk/uniquid.html

    Unique Identifiers in a Digital World http://www.ariadne.ac.uk/issue8/unique-identifiers/

    The DCMI Citation Working Group, that is, mainly Ann Apps, is preparing guidelines for capturing bibliographic citation information within a Dublin Core description:

    http://epub.mimas.ac.uk/DC/

    The ‘Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata’ has been in gestation for a very long time, and there might be some lessons to be learned. Eventually, the OpenURL framework was chosen as a suitable reference. OpenURL is apparently also considered by a NISO working group for article citations, NISO TG3/SG3.

  14. Bruce D'Arcus says:

    Peter sent these comments by email because comments were closed. I’m reopening them to register them here. The following message is his response to my question of what to do about the problem.

  15. Peter Ring says:

    If you want to be able to reliably exchange bibliographic information on a web scale, there’s no silver bullet. In effect, you want to do something similar to what WorldCat [1] does wrt. creating a ‘market’, a place for pooling the cataloging effort. At least, study their guidelines, e.g. ‘When to input a new record’ [2], to get an idea of what to expect.

    Prepare to - take existing sources of bibliographic records into account, or you won’t get critical mass - handle gracefully a number of different identifiers and identification schemes - device a way to credit (or discredit!) the source of a bibliographic record, and a way of rating bibliographic records

    On the web scale, I see the following trends:

    • FRBR is the common frame of reference
    • MARC or some updated variant of MARC, such as MODS, is the storage format
    • OpenURL is becoming popular as a retrieval interface,
    • while Z39.50 is still alive and kicking
    • some variant of a Handle [3] is used for global identification of resources
    • but there are also large national URN-based initiatives

    You can’t, as you’ve noticed, rely on a DOI being available. CrossRef already has registered more than 14 million DOIs, but there’s more than 55 million records in WorldCat holding, growing more than 2 million records a year. One each 10 seconds [4].

    While DOI is being adopted in a large scale, e.g. by TSO in UK for public documents [5], severeral national libraries still don’t fancy DOI because they perceive it primarily as as device for IP protection (which I think is not the case). A number of national systems are being developed. There were a seminar on persistent identifiers last June that provides a good overview [6]. For a good example of a national scheme, see Epicur [7].

    WorldCat is somewhat open for search and retrieval [8], as are many national online catalogs, but don’t expect bibliographic records to be freely available for redistribution; they represent a considerable value.

    Within specific fields, there’s often already someone who has collected a bibliography within the field, e.g. the Computer Science Bibliography [9]. These records might be more freely available for redistribution.

    [1] http://www.oclc.org/worldcat/default.htm [2] http://www.oclc.org/bibformats/en/input/ [3] http://www.handle.net/ [4] http://www.oclc.org/worldcat/grow.htm [5] http://www.tsoid.co.uk/ [6] http://www.ariadne.ac.uk/issue40/erpanet-ids-rpt/ [7] http://www.persistent-identifier.de/?lang=en [8] http://www.oclc.org/worldcat/open/default.htm [9] http://www.informatik.uni-trier.de/~ley/db/index.html

  16. Bruce D'Arcus says:

    I’d add to Peter’s list of trends SRU/W. Z39.50 may be “alive and kicking” but it won’t be for long once SRU/W takes off.


Creative Commons License Creative Commons License