New Projects
Two new open source bibliographic projects of note. The first is called TEI XML Bibliography Project, and comes from Paul Tremblay. Paul was originally aiming for something more ambitious (e.g. a new bibliographic schema), but I think our conversations convinced him it’d be better to focus on something less grand.
I’ve tried to convince him of the importance of BiblioX to bibliographic formatting in XML, but obviously failed. There are thousands of bibliographic styles in circulation, and it seems hopelessly unworkable to write individual XSLT stylesheets for each and every one of them, for each and every output format. While BiblioX still needs a lot of work, the basic principle of a document-and bibliographic-data-agnostic XSLT-based formatting system is sound.
I also stumbled on yet another new project called Bibliophile. This one is a little different in that it is not a standalone software project, but rather an effort at standardization.
Bibliophile is an initiative to align the development of bibliographic databases for the web. It aims to promote standards, discussion among users on necessary features and a variety of specific solutions for different fields of research.
I like this. It gets even more interesting when they say they’re standardizing on MODS.
However, it’s also a little disheartening to see so many projects (usually based on PHP and MySQL) reinventing the wheel over and over again, and often not very well. In particular, it’s one thing to support MODS for data exchange, but these projects really need rich internal data models up to the task of representing MODS data. And yet, they all seem to start and end with the BibTeX data model. Please, people, understand that BibTeX has a very limited—totally flat—data model that is not at all sufficient for scholars outside of the hard sciences, and which takes zero advantage of the power of relational databases and (and even more) XML.
So where to look instead for inspiration? RefDB already has a better data model than BibTeX, and is currently being revamped with a richer MODS-compatible data model (see here for some of the code). And someone is working on a PHP-based interface, which will likely ultimately be based on a PHP module similar to the current RefDB Perl module.
For something more radical, how about LibDB? LibDB is written by Perl hacker Morbus Iff, and is based on principles in the FRBR (pdf). Its SQL schema has separate tables for works, for people and their roles, for events, etc., and Morbus is seriously considering revamping it as a plug-in for the open source CMS system Drupal. While begun as a project to store videos, it is designed to store any bibliographic metadata (save, yet, for what librarians call “analyticals” – articles, book chapters, etc.).
update: David Wilson just pointed me to another potentially-interesting-but-flawed project that basis its data model on bibtex called B3.
Creative Commons License
For what it’s worth, I’ve started work on the plugin (a “module”) to Drupal today (the CVS will be at http://cvs.drupal.org/viewcvs/contributions/modules/libdb/). It’ll be based off the CVS version of Drupal (which will become 4.5 sometime later this year), and thus, will need a decent amount of instructions as well as reasoning (”why did you decide to extend Drupal as opposed to stay fresh?”). So, although the database schema that darcusb points to is still 100% valid (and will live in Drupal nearly unmodified), the move-to-Drupal is still heavily in progress.
Hi,
Personally I think Bibtex has a lot more life in it. Why?
1) Because so many people actually use it. 2) Because greatly multiplying the ‘types’ of entries while keeping the basic bibtex structure is a snap. 3) Because generating valid MODS-formatted entries from entries stored in Bibtex-based databases is not that hard.
We’re at a point right now that there are no easily accessible open-source software packages that use a different standard. So we’re in a warring-states period where all kinds of standards end up competing. There is no way to determine which one will win, and often the least elegent/efficient solution (querty?) wins out.
This is why I think bibliophile is right on track. I agree with them that it’s crucial to provide first for inter-platform interchangability. It’s a real pain to completely revamp database structures, but not that hard to code for importing/exporting from/to MODS.
But once this sort of standard is established, you can almost hear the developers saying: gee whiz, why don’t we just store the data that way! But it will take the widespread adoption of a particular standard before that eureka moment happens.
Then, in the future, if one standard is widely felt to be superior, the infrastructure to facilitate the rapid and widespread conversion to that standard will be in place, and we’ll be in bibliographic nirvana.
And what’s the importance of how the data is stored in the database? As long as I can get all the data in and out properly, what difference does it make?
I also don’t understand how Bibtex hampers the powers of relational databases. Authors/host titles don’t have to be in separate tables for me to do lots of fancy joins and groups to the database.
jl
I like a strong argument! OK, why does BibTeX suck?
1) unlike DC and MODS, it is fundamentally based on the vague and inconsistent concept of “type.”
2) it has no heierachy at all.
The two of these together are expressed in the extreme awkwardness of fields like “chapter,” which has the problem that it is a) horribly concrete (what if you want to indicate a web page, which is structurally the same?), and b) in fact represents the chapter title. Because BibTeX is flat, you cannot conceptually have:
3) You can indeed extend this broken model, but that means likely breaking portability (since everyone will use different methods to represent that extended data).
Rather than speak in abstractions, here’s a couple things BibTeX cannot handle:
If you can reliably import and export the following — and also any other “types” that people want to store without modification — then that’s great, but I assert you cannot do it basing the model on BibTeX.
I agree with you that using MODS for exchange is a good first step. That’s no subsititute for the next step though.
As for the difficulty in updating to a new data model, surely it’s a help that two other projects have already done a fair amount of work on this.
Hi Bruce!
First, what’s wrong with types? This is additional information about the item, no? We want to know if it’s a book or an article, no? The problem with the original Bibtex was that the range of types was way too restrictive. If that’s taken away, and types can be created at whim to suit user needs, then that problem goes away, no?
Putting heirarchy into an improved version of Bibtex isn’t that hard. Indeed the editor and booktitle fields I think were a first attempt. They could be made general ‘host creator’ and ‘host title’ fields (I should have a working example of how this can be done in a few days), thus establishing the relationship between part and container.
In addition, I’m pretty confident that an expanded Bibtex could account for about 99.99% of entries that people want to plunk in, including all of the types you mentioned above and many many more. Coming up with a new type won’t be that hard, and at most would involve some field mapping. This is where classes will shine. You just plunk in a bibtex->MODS converter class, for example, punch in an array of your bibtex entry, and you get back valid MODS. It will take some poor soul some work, but then no one else has to do it.
I agree with you that the ideal would be to have everyone magically agree on one standard, and use that for both database and queries. But as I said before, I don’t think we’re there yet. I’m not there yet. It’s just not worth it to me right now to completely revamp everything. It would take me months of hard work, where I can get a system up and running very soon without sacrificing the exchange of items.
None of this is straight Bibtex, but as you say, why reinvent the wheel?
jl
First, what’s wrong with types?
There’s nothing wrong with including that information. MODS does so with its “genre” element. The problem is a) the type list is fixed in bibtex, and b) it is structurally dominant.
In BibTeX and Endnote, you first have to say “OK, what type of record do I have?” The problem is, what if your record doesn’t fit the model?
In MODS, by contrast, you worry instead about the stuff that really matters (titles, cretors, genre, medium, etc.) and you can specify whatever genre you want (preferably drawn from a controlled list, of course).
Right, I realize these are the main limitations of the current version of Bibtex.
–The problem is, what if your record doesn’t fit the model?
You create a new type. And if you can’t/don’t want to do that, you put it in ‘unknown’, and make sure that the genre is included somewhere in the entry, so that no information is lost if the item is exchanged.
jlassen: knowing little of BibTeX, can it approximate the FRBR model of Work, Expression, Manifestation, and Item? That’s what the LibDB data model attempts (along with Group 2 and 3 entities, and some more sugar).
Hi Morbus Iff:
Not as far as I know. BibTeX stores items in a single table with lots of fields - much like an expanded libdb_work table I imagine.
Yeah, BibTeX is far from the rigorous hierarchical abstractions of the FRBR. It is the most flat/simple bibliographic format that I know of.
Note, though, that I’m still not myself convinced that the full-blown FRBR distinctions are needed for this sort of (scholarly) stuff. I’m hoping/betting you can show the practical value of it, Morbus, so that the power of the FRBR is obvious when needed, but otherwise disolves into the background.
A signficant problem with the “improved BibTeX” you mention Jonathan is that — for BibTeX purists — to really fix it would effectively mean it was no longer BibTeX.
And there’s a lot that simply can’t be fixed. For example, I doubt you could code the same title in two languages, each tagged with that language, and also to indicate which was the primary title and which was the translation. (A guy on the RefDB list recently explained he needed just this; he’s a humanities scholar doing work in English and Japanese.)
That’s easy to handle in XML with MODS, and needs support in the database tables.
Hi Bruce,
Yes, it will no longer be BibTeX, but an extension, backwards compatible.
For the translated item you mention above, two things will be possible (in a couple of days!):
1) The admin could add another field to the database (for example, trans_title), and make this an ‘included’ field for certain types (book, article, etc.) That means that the field will appear when adding items of those types, or when importing items of that type.
2) The admin could add an entirely new ‘type’ - translated book, for example, and populate it with whatver fields he/she wanted.
This may sound like a bit of a pain, but it won’t be. I’ve just put all of the fields and types into the mysql database, and it’s going to be a breeze not only to set up different types and fields, but also to create and tweak different display styles (chicago, AAG, whatever) for the various types of entries. This will happen via wiki-markup language.
The language tag is an interesting question. I think that will (for now) have to be one of the wiki-like tags in fields that are run when you include [[...]] in a field. For example, [[l:japanese]], or [[l:iso-8859-1]] at the beginning of the field could be used to indicate the language of the field.
jl
Very interesting discussion. Thanks for the comments about bibliophile. Re. you’re charge that we’re reinventing the wheel by basing our database/strutural models on BibTeX, I should say in my defense (as developer for wikindx) that until about 4 months ago (about 3 months into developing wikindx) I’d never heard of BibTeX and I certainly did not base wikindx database structures on BibTeX. I only added import/export of it in because someone asked me to at sourceforge. I also think BibTeX is flawed and many of the types of resource I deal with (in the arts) are not or cannot be represented by bibtex. But lots of people use it so I added support for it in.
I based wikindx around the types of resources I would be interested in - it doesn’t yet handle all types I want (TV, games, films etc.) but will eventually.
Enough of the plug….
Bruce, it seems to me that this comment: “In MODS, by contrast, you worry instead about the stuff that really matters (titles, cretors, genre, medium, etc.) and you can specify whatever genre you want (preferably drawn from a controlled list, of course).”
with phrases such as controlled list, specifying a genre etc., is really no different from specifying a bibtex type. They’re both controls with the structural advantages that gives and the limitations that implies. You’re simply substituting one set of rules for another.
Also, no-one has yet mentioned the need to convert bibliographies to different bilbliographic styles (MHRA, APA etc.). It’s an unfortunate fact of life that a proliferation of such styles exist (every night before I go to sleep I rain down a thousand curses on the heads of the originators). However, a programmer designing conversions from bibliographic databases for such styles (as wikindx does) HAS to know what type a particular resource is as the presentation of the resource entry for a particular style very often depends directly on the type of resource. A journal article is displayed quite differently to a newspaper article, to a chapter in a book, to an article on the web or to a proceedings article.
A programmer’s life is made much simpler (and the resulting code more efficient) if we don’t have to go hunting around or attempting to guess what type of resource something is based on some database field that may or may not be there (trying to figure out whether a BibTeX resource is actually a web article or not is a case in point - some people use the non-standard field URL while others use @misc with a howpublished field, still other attempt to indicate in the note field that this is a web resource).
Mark,
with phrases such as controlled list, specifying a genre etc., is really no different from specifying a bibtex type. They’re both controls with the structural advantages that gives and the limitations that implies. You’re simply substituting one set of rules for another.
I see it very differently. BibTeX has something like 13 types. The MARC genre term list, by contrast, contains about 50. It’s far easier to change a list of values than to change the data model. If you add, say, a type of “legal case” to bibtex, you must add the accompanying field of casetitle, etc., etc.
Also, no-one has yet mentioned the need to convert bibliographies to different bilbliographic styles (MHRA, APA etc.).
If you read through my blog, you’ll see that has been my primary obsession. BiblioX was in fact my idea, and it was all about figuring out how to format MODS records without relying on bibtex-esque typing (though Peter Flynn has managed to code it differently than I intended). And RefDB has pioneered formatting in XML. I suggest you take a look.
However, a programmer designing conversions from bibliographic databases for such styles (as wikindx does) HAS to know what type a particular resource is as the presentation of the resource entry for a particular style very often depends directly on the type of resource. A journal article is displayed quite differently to a newspaper article, to a chapter in a book, to an article on the web or to a proceedings article.
I’m afraid I have to disagree with you. I’ve written an XSLT stylesheet that pretty well proves that one can successfully format most records without considering type/genre. I’ll post something about it later.
It would be interesting to see the XSLT stylesheet. I notice the BiblioX example available here: http://www.silmaril.ie/bibliox/biblioxdoc.html still requires a type (e.g. ).
…e.g. reftype class=”book” aftersep=”.”
http://www.silmaril.ie/bibliox/biblioxdoc.html still requires a type (e.g. ).
Yes, and this my big objection. I want to see a higher-level structural abstraction and then mandate generic definitions within them. Genre-specific definitions would then override them. That gives the best of both worlds.
No offense but I’d like to take issue with your swingeing attack on the projects at bibliophile. I can’t speak for the other projects but the one I’m involved in has a database structure that is not in the least based on bibtex.
In a sense, it’s similar to what you wrote about LibDB (”Its SQL schema has separate tables for works, for people and their roles, for events, etc.”) but not so all-encompassing. At first glance at the LibDB schema it doesn’t seem to have the facility (not a criticism merely an observation) for quotes, thoughts, paraphrases that wikindx has.
I think it’s a mistake to view all bibliographic databases as performing only one and the same function. My intention with wikindx is to create a shared research space that happens to be based on a bibliographic database but also allows participants’ ideas, selection of interesting quotes and comments on them etc. to be perused and themselves commented on. The free sharing of knowledge, not just a list of dusty tomes.
Mark — I was a little harsh in the post (sorry!), but it wasn’t really aimed at you and your project. It reflected a general frustration of mine at how many open source bib projects are out there, and very few seem to cooperate.
For example, you could have based your’s on RefDB, which has annotation support built it.
I actually embed my annotations in the mods extension element. I’ll send an example to you.
It would be a great step forward if we could get BiblioX — or at least its style spec language — in shape to be usable in projects like your’s.
Concerning “quotes, thoughts, and paraphrases”, LibDB does support this stuff, and far more. The “annotations” and “annotations_types” tables allows the user to define as many types of annotations as they want - the default SQL offers “Tagline”, “Review”, and a few others. As a user is adding or editing a bib record, they’d be able to assign as many annotations, of any type, as they’d like. Annotations can also be keyed to a user record, so that annotations can be “owned” - people who regularly enter false bib data (or just naughty diatribe) could be restricted from annotating again, etc. (though, this functionality has not been implemented or thought 100% through).
In addition, annotations can be associated to ANY other record in LibDB, not just bib entries. It’d be possible in the current data structure for someone to “annotate” a “person”, “corporate body”, “concept” or even another annotation (such as a correction or clarification, etc.).
Being able to attach annotations to whatever metadata is indeed cool.
However, I actually don’t like wikinx’s separation of quotes, from “musings,” from “paraphrases,” from “random paraphrases.” I don’t work this way. I want a single annotation type in which I can include whatever semantic markup (quotes, highlight phrases, citations, links, etc.) that I want and have it converted to XML. This is exactly what wki markups are ideal for, and Jonathan is proving it actually works!
Well, a ‘random paraphrase’ is not a type of annotation - it’s simply a (fun) way of presenting a bit of knowledge that someone suggested would be easy to implement and, in large databases, might throw up a piece of information that a user might never normally see in a flat view.
I’m all for co-operation but anti monolithic monopolistic systems. Sometimes re-inventing the wheel is a worthwhile process. You mention that you don’t work ‘this way’ re. wikindx’s separation of quotes, paraphrases and musings. I do and I know several other people who also do. If you were to design one system that only did things one way then you cut that group of people out. There’s strength and opportunity in variety as Darwin was well aware. It may be that a future version of wikindx will give users/admins the choice over which method they wish to use or perhaps find some way of combining them.
No Bibliophile is not about creating a standard bibliographic database/system. Among some of us there was such a tentative discussion but it was dropped when we recognised that our different projects had quite different aims and uses and the ‘one size fits all’ model would not suit all our research practices. What bibliophile is about is generally discussing common solutions to common problems and, right now and specifically, coming up with a system to get the wonderful variety of bibliographic databases out there talking to each other for cross-database searches.
Morbus, Apologies, my post should have read ‘At first glance because I don’t have time to look into it fully now but promise to later…’.
Can I suggest you both join the Bibliophile project (even if only its mailing list where we’re deep in the throes of a discussion on servers, registration of databases and xml-rpc)?
I beg to differ: a ‘random paraphrase’ /is/ an annotation, in the sense that I take the most argument-favoring definition from dictionary.com (”A note, added by way of comment, or explanation; — usually in the plural; as, annotations on ancient authors, or on a word or a passage.”).
Personally, I myself don’t work like Bruce, but that doesn’t mean he couldn’t use LibDB’s data model. If he wanted to go and make an annotation named “Potpourri” and dump all his crap in there, so be it. I, on the other hand, would love annotations defined as “Review”, “Historical Fact”, and “Musings” (”i wonder if this book could illuminate further the thoughts in BibRecord #314?”). My “Potpourri” anontation, in this case, would be stuff that didn’t fit into my more finite clarifications previously defined (for example, “Death by axe to head”). If, however, I find myself annotating “Death By” for all my horror movies, I’d be able to go and create a “Death By” annotation, thus giving me searching and catagorizing capabilities for that type of data (”show me all deaths caused by an axe between the years of 1980 and 1990″).
Joined the mailing list.
I should clarify on annotations: I have no problem allowing people to categorize their annotations how they like. For me, I it’s totally artificial to store quotes separate from commentary related to those quotes.
I use my annotations as the beginnings of publishable content, which I then integrate into my manucripts. All of this is done with XML for me. See some of Jon Udell’s or Kimbro Staken’s stuff on micro-content. They’re both using XML DBs though, where you can just do an xpath match that says “give me all paragraphs that contain a quote from John Doe.”
Bruce: re. the use of wiki markup language and being able to annotate comments, quotes etc. with markup language. There’s a whole class of (potential) users of such bibliographic databases who would shy away from such a complication while, agreed, there’s another class that would take to it like a duck to water. I can forsee wikindx and/or systems perhaps offering a ‘power-user’ mode where such a facility is available. But again, one size does not fit all.
Bruce: re. separation of comments from quotes. wikindx maintains comments with quotes and paraphrases sch that when you see a quote or paraphrase, the comment is always displayed too. Musings are a separate item and represent thoughts about the resource (or part of it) and hence are not tied to any quote or paraphrase. It’s along the ‘I wonder if..’ lines that Morbus mentioned earlier or ‘it might be worthwhile comparing what Dr. X is saying in chapter 12 to what Freud has to say on the oedipus complex…’.
And certainly the types of searching you mention (although not (yet) that specifc one) can be done if not (yet) in natural language.
Morbus’ use of annotations is similar to my use of keywords (user-creatable) and the larger groups (admin-creatable) so a wikindx user might also create a ‘death by’ keyword attached to all relevant resources.
Interesting discussion on comments. I hadn’t thought about this at all. The most intriguing for me is allowing users to type their own comments - musings, annotation, quote, rant, reply, etc… This could make it a lot easier for users to navigate the meta-data on some work if there’s a lot of the stuff.
Hmmmm.
I much prefer your approach Jonathan, where annotations are just like weblog posts (following that, it might be nice to allow keywords be attached to them).
I still think separating quotes from “musings” is artificial. For example, in a review it’s perfectly common to includes quotes from a work.
If one is concerned with storing metadata about quotes such as page numbers (as I am), just put it in the wiki/xml markup (e.g. “A quote[[c:doe99@22]]”).
Yeah, I think it would be useful to have an admin-defined select field added to the comments:
you’d get a drop down menu with things like:
etc..
I’m not sure if an open field of keywords would be all that useful, but that would be easy to add.
hmmm.
In the example above, do you mean to put “A quote[[c:doe99@22]]” into the message?
(see: http://www.chinastudygroup.lunarpages.com/biblioshare/index.php?action=biblio&type=view&id=230) for an example.
Hmm. Perhaps some special wiki markup language might be helpful in the comments, like [[p:[/cbyat/]45]]. Since we already know what item you’re referring to, there’s no need to type in the wiki_title again. The p would signal that it’s a page number reference. I guess it’s just a little change.
Yeah, but the example would be rendered like “blah blah blah” (Wang, 2003: 45).
As for the second bit — the redundancy — why not just “A quote[[@22]]” or “A quote[[c:@22]]”? It’s more consistent with the other markup (and also familiar to Endnote users, BTW).
Right. Should have it up by tommorrow.
I’ve been spending some time over the last few weeks figuring out how to have cross-referencing etc. in wikindx. I’d been edging towards some kind of mark-up language that refers to the resource possibly with the page no. wikindx uses integers as unique identifiers for its resources (easier to generate and handle in RDBMS) so I would favour quote[[121@22]] meaning resource no. 121 page 22. If it is quote[[@22-23]], it means this resource that we’re already dealing with, and a quote spanning pages 22-23. A potential problem is a bulky interface that somehow has to provide the id (or any other reference) and the title for all resources for the user to select from.
I think instead of using ID numbers, it’s easier for users to use a citation key like Jonathan is doing (and as is done in bibtex). An example might be DoeJ1999a. I use that when I cite in XML, BTW, and it is the ID for the MODS records. Otherwise, I agree with your thinking.
Mark,
Or you could just add a field to the ‘resource’ table, and index it. Then just make sure that a unique alpha-numeric key is inserted with each entry (like DoeJ1999a). Then you can basically use that field as a unique id.
You write: “A potential problem is a bulky interface that somehow has to provide the id (or any other reference) and the title for all resources for the user to select from.”
I’m not sure what kind of interface you mean, can you elaborate?
Jonathan
Thinking of the monitor real estate much of which is taken up with the textarea input for the quote/comment whatever. The user can’t be expected to remember the correct citation (JDoe2002d) so has to be presented with a list from which to browse and select. This takes up space on the screen especially if you also wish to display the first author’s name too.
I’d imagine some javascript trickery might help here.
When I played with mockups, my idea was the commenting UI would work just like in a weblog: a window would pop-up. That would allow users to access a clean table view in the main window if needed.
Note, I think the wikindx table view could be improved so as to be more readable. Also, everything there should have links (for column-resorting, to access creator names, or the record itself through the tite). Finally, I hate separate search interfaces (read Tim Bray’s stuff on search UI’s). Why not a search field on the same page as the table to filter it?
Worth thinking about especially the bit about column resorting. However, having a search interface on the same page as the bibliographic list either cuts down the amount of resources you can show or increases (to me annoying) vertical scrolling.
I’m just talking about a single, simple field, a la google. It need only take a few centimeters of vertical real estate.
See also this mockup, which eXist XML DB author successfully coded.
http://www.users.muohio.edu/darcusb/misc/biblio-2.html
The search panel there is always accessible, and the sidebar is all done with CSS.