XML and RDF

Last week at the Access 2005 conference, I told a room full of mostly library people that their XML standards (I was talking about MODS and MADS in particular) are needlessly complex, inflexible, and awkward; that they were not hacker-friendly. I showed them an alternative schema I’ve been working on that is better, cleaner and much more hacker-friendly XML. Modeled on DOAP, this schema also happens to be RDF, and I exploit the basic plumbing of RDF like its linking support (which I explained was much more consistent than the use of xlink in MODS) to yield a data representation that I think may well be close to a perfect balance of simplicity and expressiveness for citation-related metadata.

As I explained to the audience, before I left I got a comment from a Mac developer on the schema. He has been working on a MODS GUI editor interface, which I have always felt would be a difficult task without some pretty serious abstraction, and so I thought what I was working on might give give some ideas.

Now, this developer has no experience at all with RDF, nor much with XML. However, when he looked at an example instance, he immediately understood why the basic structure of RDF would be valuable. He didn’t use the word “triples”; he just recognized it made sense to have authors be full resources: Person objects that one links to. As he concluded, he also found it much more readable than MODS. Exactly what I was after in fact!

My point goes against the grain of common wisdom, which says that XML is easy, and RDF hard. I simply believe both statements are wrong. What after all, is more basic than the notion of making statements about things as a list of triples, and using uris and namespaces to disambiguate names? My point also goes against a lot of the discussion about RDF that seems to be getting quite heated. Only this time the heat is in the RDF community itself, as people argue about the future of the technology in the face of outside pushback that is mostly about the XML syntax.

Let me say a few words on this as someone fairly new to RDF, and rather more experience with XML. Yes, the XML/RDF syntax needs work to make it more friendly to XML tools. If people like Edd Dumbill feel the need to rely on RELAX NG to constrain the syntax, then there’s something wrong (though I feel the need to do the same with the non-RDF MODS schema, so I’m not sure that’s such a big deal). I’m not one who believes, however, one must throw the baby out with the bath water. Fixing a single problem in the spec — that one can use attributes to denote properties — would go a long way towards rationalizing it.

What else? I don’t understand the purpose of Alt and Bag, and bet I’m not alone. Likewise, while coming from an XML background leaves me predisposed to wanting to use reification, I tend to think it complicates the triples model without clear pay-off.

I’m also still not sure about the need for datatyping.

But bigger picture that all manner of critics sometimes forget: RDF is trying to solve some hard and important problems. Metadata is hard, particularly in a distributed context, and I see nothing out there will offer a reasonable alternative.

Certainly the RDF world could look at simplifying the XML syntax, but I agree with Dan Brickley that an even more important goal is continuing the evolution of RDF tools. If hot application environments like Ruby Rails had RDF support that mirrored its current SQL-based ActiveRecord, then that will do more to encourage uptake of RDF than anything done with standards documents.

But let’s keep in mind my bigger point again: metadata is hard, and technology is not an ideal world of black & white options, but a messy one of grounded compromises. One can very easily do very bad XML (witness OPML), while writing very clean RDF/XML. And I believe designing an XML schema based on RDF would tend to lead to better XML design. Large companies like Adobe are putting RDF to practical use. Perhaps, then, we need less talk at the level of Platonic technology ideals, and more practical discussion of what problems we need to solve, and how to do just that?

11 Comments

  1. Bruce D'Arcus says:

    I realize after posting this my comparison of MODS and my schema may seem unfair. For the record, I know that the LoC had to deal with a lot of legacy issues in designing MODS that I simply don’t care about (nor should I). Also, MODS needs to carry a fair bit more information than does a citation-oriented schema. I still think they should have started with RDF, though, and only discarded it if there was an explicit need to do so.

  2. Bill de hOra says:

    ” If hot application environments like Ruby Rails had RDF support that mirrored its current SQL-based ActiveRecord, then that will do more to encourage uptake of RDF than anything done with standards documents.”

    I’ve tried this for Django. It’s hard, and it might not be possible generally. That’s because RDF doesn’t a have a type system to hang your mapping code off. The dirty secret of the new dynamically typed frameworks is that they are dependent on the type system of the database to shuttle data back and forth from the presentation and persistence layers. RDBMSes are providing very useful constraints to mapping tools like ActiveRecord that makes the entire setup possible. You stop using that or Django Meta, and you can get in the weeds quickly.

    I think it could be done for RDF vocabularies and I think it makes more sense that way. What’s the point of reinventing a User class and all the corresponding stuff when you can optimize on FOAF? Provide the Objects for FOAF and then key the objects off that. IOW, to maximise RDF’s value you stop worrying about Data and Object structures, and focus on application logic and views. Definitely you will not get middleware and web programmers to work directly with triples; they need a domain model in code to reason about.

  3. Josh Berkus says:

    Bruce,

    Oh, I don’t know. Legacy development is often an excuse rather than a legitimate reason; that is, developers do it because it’s less work just to fix the worst problems with the old stuff than to do a real refactoring. “Oh, we can’t do that, it would break too many things.”

    Of course, with a government agency sometimes they’re not allowed to do a real refactoring.

    Regardless, if the specs you’ve sent me are any indication, the LoC is still plagued with IT managers who learned data management in the old mainframe network-database days and understand neither relational databases nor markup languages. For example, the typespec you sent me had “User Indicated Field 1-5″. That’s flatfile, pre-relational thinking there.

  4. Bruce D'Arcus says:

    Yes, Bill, that’s what I was thinking. Give Rails or Django and RDF schema and then be able to just do stuff like “print foaf.name” without hassle. So using the schema as a similar kind of constraint you note that the databases provide now.

    Was trying to be diplomatic Josh :-)

    There are some things in the library specs that are a function of legacy needs and quite deep-seated traditions. For example, cataloguing rules say that contributorship tie to items via “authorized names,” which are used to uniquely identify people and organizations. This is an awkward way to do things in the 21st century, and accounts largely for the weakest part of both MODS and MADS. The problem is those sorts of rules absolutely cripple the latter.

    Converting tens-of-millions of legacy records to a fundamentally different model is also no doubt expensive!

    That said, the name issue (which is important in bib data when you deal with transliterated names in particular) isn’t solved in the RDF world either, though it’s certainly been recognized in the FOAF world.

  5. Bruce D'Arcus says:

    Ahem, that’d probably be something like “print a.foaf.name”.

  6. Kool says:

    Hello,

    I am the “Mac developer” Bruce mentions in this post. I just wanted to make clear that by stating that I found RDF more readable than MODS was in the context of a reference manager. I am not sure if RDF would be better for the library index for which MODS was specifically designed. One thing that does happen when one takes a first look at MODS is becoming overwhelmed with what seems like a very complicated structure of nesting for certain elements.

    The RDF approach resembles the object oriented programming structure I have become acquainted with using Cocoa on Mac OS X. By separating entities into “classes”, those that appear more than once in a file/system share the same entry. For a reference manager this seems to me like a very good approach. In a way it also should be good for usage in library index systems, after all, it makes finding e.g. all books of a certain author easier, but the disadvantage of spreading out an entry for an item over several entities might make it more difficult to manage.

    What are “triples” in the RDF context btw?

    I guess both RDF and MODS have their pros and conts, I can’t say which is the best. I can however say that RDF would be the format I’d prefer if I’d have to write a reference manager.

    Johan

  7. Bruce D'Arcus says:

    Johan — “triples” refers to the basic subject-predicate-object model of RDF. E.g. [johan] -> [respondedTo] -> [bruce]. In RDF, then, you make statements about things by creating lists of triples.

  8. Edd Dumbill says:

    Two quick points:

    I wrote a RELAX NG schema first so I could easily author instances in Emacs nxml mode, and second to enable some simple vetting for authors who didn’t want to understand RDF. I never really thought of it as going against RDF, just making it easier to use DOAP.

    Second, for binding, Jo Walsh has been working on something like this for Nodel.org. As I’m travelling right now, I don’t have the references to hand, but the basic deal is that it’s a layer on top of Redland/Py.

  9. Bruce D'Arcus says:

    Edd — I didn’t mean to suggest that you were “going against RDF”; just that it’s clear one reason DOAP has taken off is precisely because there is a constrained syntax that works for XML people and tools.

  10. [...] Recently at darcusblog there was a discussion on simple vs complex in the library data sets. To quote: Last week at the Access 2005 conference, I told a room full of mostly library people that their XML standards (I was talking about MODS and MADS in particular) are needlessly complex, inflexible, and awkward; that they were not hacker-friendly… [...]

  11. [...] I recently posted here regarding standards and libraries, specifically the need for lightweight APIs/formats for use in various projects. I also mentioned an article over at darcus blog regarding light vs complex, and there is even a bet that lightweight will win over heavyweight. While that can be debated, there is definitely a place for lightweight implementations. [...]


Creative Commons License Creative Commons License