Archive for September, 2006

Discussing vCard/FOAF

Posted in Uncategorized on September 16th, 2006 by darcusb – Comments Off

I see new W3C Semantic Web lead Ivan Herman started a discussion over on the life sciences list of my suggestion to update vCard in RDF and to harmonize it and FOAF.

The discussion that ensued is interesting, but it strikes me that there’s a tendency to lose the forest for the trees. Might be that the question Ivan posed was too open-ended.

The issue with harmonizing the two is at a macro level really basic, and involves answering the following questions:

  1. Does structured personal name data get represented as properties, or as classes? FOAF does the former now; vCard the latter. The latter is a little more complicated, but better.
  2. Should structured personal name properties be consistent, and consistently international-friendly, or not? FOAF in its current incarnation says no; vCard yes.

There are other issues having to other with other properties (which properties, whether to separate out social networking stuff into a separate namespace, etc.), but those are detail questions that it seems to me easy enough to resolve.

I think the first step is to resolve vCard-in-RDF. While it might not be possible to “dump” the existing note altogether as I suggested, the W3C can certainly add a new one, and deprecate the old. This is really not hard!

Ultimately, it ought to be possible (and sensible) to do:

<foaf:Person rdf:about="http:/ex.net/info#me">
   <vcard:fn>Jane Doe</vcard:fn>
   <foaf:name>
      <vcard:given-name>Jane</vcard:given-name>
      <vcard:family-name>Doe</vcard:family-name>
   </foaf:name>
   <vcard:adr>
      ...
   </vcard:adr>
</foaf:Person>

Photo Metadata

Posted in Uncategorized on September 14th, 2006 by darcusb – Comments Off

Norm Walsh discusses what looks to be a fiendishly cool new photo metadata editing application. Rails + OpenID + all the metadata flexibility and power of RDF.

In the screenshot, his drop-down selection lists are in fact links to resources. E.g. all those properties are not strings, but links to full description objects. I imagine he may be doing something clever with that infrastructure (I dunno, maybe when the user hovers over such a property a tooltip pops up with additional information?).

Kind of fits how I was thinking about the OpenDocument use case to be able to link a document author property to a contact record. Imagine an ODF editor loads a contact file, and a user can choose from a list.

Also of note is his UI for configuring the metadata display and editing GUI. In some ways, this is exactly the sort of support I’d like to enable in ODF applications, though obviously with a more user-oriented UI.

Plugging Into FRBR, Killing MARC

Posted in Uncategorized on September 11th, 2006 by darcusb – 4 Comments

Realizeed my diagram was sort of wrong (missing ids), and so have changed it.

Karen Coyle has a post trying to spur discussion about how to hasten a world of library metadata beyond MARC.

My response in the comments was:

You won’t be surprised to know that I think [the way] out of MARC and its legacy (including MODS frankly) is RDF. So think about stripping down MODS to something closer to DC and DCQ (though it might be different in fact) and then through RDF/OWL schemas plugging those more grounded views into the more abstract world of FRBR (which has now has a nice RDF representation).

Let me demonstrate what I mean.

I have been working on an RDF schema for citation metadata. The idea is to have a rich model and XML format (backed up by a RELAX NG schema) that is easy for pretty much any developer to pick up and use. The Zotero guys are starting to use this in fact.

This means generally to rely on really grounded property terms—author, title, shortTitle—and to embed the meaning in the OWL schema.

OK but what does this mean? An example:

sbo:creator a owl:ObjectProperty ;
    rdfs:label "creator"@en ;
    rdfs:subPropertyOf frbr:creator ;
      rdfs:domain [ owl:unionOf (sbo:Reference sbo:Note) ] ;
      rdfs:range sbo:Agent ;
      rdfs:comment "An agent primarily responsible for the intellectual or artistic content of a work."@en .

What am I saying in this short space?

  1. sbo:creator is a reference to another object; an sbo:Agent
  2. its English-language label is “creator”
  3. it is a subProperty of frbr:creator

The last point is particularly critical, because it means I can plug my data into a FRBR view. Even just the natural language documentation value is evident here, but more importantly, RDF reasoning tools will be able to make use of this. They will know that a sbo:author is also a sbo:creator, which is also a frbr:creator, and therefore can infer additional information, such as that it refers to an agent responsible for creating a frbr:work.

This is what I mean by taking more simple, grounded, representations, and plugging them into more the more abstract view of FRBR. Through the ontology and related tools, you can—if you want (you can ignore FRBR entirely)—basically take more-or-less flat data, and explode it into a larger view, mixing and merging data as you go. Representing this graphically (where dashed lines indicate inferred statements):

I’m still working out exactly how to add this logic to my schema, but I see a lot of power in this approach: really simple and gounded vocabularies, but tied into more abstract models.

What I see in the last few decades of library metadata work (with the notable exception of DC) is a tendency to overburden the formats themselves with the task of carrying meaning. This results in formats that are not only more complex than they need to be, but also much more difficult to use.

Zotero and the Practical Semantic Web

Posted in Uncategorized on September 9th, 2006 by darcusb – Comments Off

Was reading this first look at Zotero and nodding my head in agreement at these two suggestions:

  1. when adding tags, should have a lookup table so you can select one that you’ve already used (good for consistency).
  2. nice to have a way of browsing by tag (as in del.icio.us), probably over in the lefthand panel.

Having been talking to the Zotero guys about this very issue, I actually was thinking about taking this even further. I’d put the requirement more generally, then, to say that Zotero ought to make it easy for users to have consistent tags, and for them to easily access data by those tags.

At some point, the Zotero team wants to have a server, and to offer services tied into that. They’ve got a new domain name, which makes it a perfect platform on which to build semantic web functionality.

Tags as they have been used in the past few years are convenient, for example, but also problematic. As above, a single user may inconsistently use different strings to represent the same concept. If you scale this to multiple users, potentially working in multiple languages, this becomes somewhat of a mess of tags.

What if instead tags got normalized such that user gets auto-complete tags, and when assigning a tag, they are tied to a URI internally?

Firefox 2 (on which Zotero is based) already has auto-complete searching, so there’s some infrastructure to build just this sort of smart support. From the perspective of a user, it can be as simple as entering plain strings, but it makes their data more consistent. More importantly, it makes their data more consistent with other users’ data, thus opening up some powerful possibilities of data merging and such.

The same can, and should, be done for other key citation objects. Zotero is in fact a perfect opportunity to realize a more semantic web.

Updating vCard and FOAF

Posted in Uncategorized on September 9th, 2006 by darcusb – 1 Comment

Been chatting with some people about the following idea …

Problem

RDF has no consistent way to represent agents and contacts.

The vCard example hosted by the W3C is an absolute mess, reflecting all the syntactic ugliness that scare people away from RDF, as well as have some modelling weirdness.

FOAF is of course much better, and more widely used I suspect, but it needs a serious cleanup to rationalize in particular its personal name structures, which now has overlapping but essentially incompatible name properties and inconsistent property names.

Proposed Solution

So how about instead:

  1. the W3C dump the above page, and replace it with Norm Walsh’s much better and more recent vCard version, giving it a nice W3C namespace as well
  2. update FOAF to adopt the much better vCard name model and otherwise harmonize the two so they are complementary

In that case, vCard and FOAF could be more easily used together. If I need to identify a contact as a Person and include social networking metadata, I use FOAF. If I need addresses and such not covered by FOAF, mix in vCard. Right this would be an awkward endeavor.

Atom and RDF

Posted in Uncategorized on September 8th, 2006 by darcusb – Comments Off

Stefano Mazzocchi has an interesting explanation of a new tool he released to read web server logs and automatically find out what other web pages have to say about your own, and why he chose to use RDF rather than Atom:

Atom is a step forward from general XML because it allows you to split the data model into many tree fragments, each with a unique identifier. But there are two issues with using Atom as a general data modelling language for many separated items: it lacks the ability to model relationships between such items…. So, while atom might have allow[ed] me to model the single items (pages, comments and feeds), I would have to extend it with my own markup to model the relationships between these items. Ending up reinventing the RDF wheel anyway and in a way that would be incompatible with RDF tools and ignored by Atom tools.

This is an issue I’ve been arguing with respect to OpenDocument. If we agree on a requirement that ODF metadata ought to be extensible (and it’s hard to argue it shouldn’t be!), and we want different metadata descriptions to be able to refer to each other (say, a document description to reference a contact description for an author, an image description to reference a licensing description, etc.), then we really have two choices: use RDF, or reinvent it.

Person Class

Posted in General on September 7th, 2006 by darcusb – Comments Off

Actually, I changed my mind; this is an input problem, not a model one. Borrowing the basic structure of vCard—with its distinction between formatted name properties and complex structured (international-firendly) names—solves the latter problem.

A solution to my discussion of the name problem in citation metadata, in Ruby code:

class Person
  # name model must be flexible in order to account for:
  #   Asian names, where sort and display order is the same
  #   Single name names (Prince, Madonna, etc.)
  # Rule, then, is that the name as it should be sorted gets 
  # used to create the object. 
  # An Asian name would be: Person.new("Mao Zedong")
  # A standard Western personal name would then be: Person.new("Doe, Jane, III").

attrreader :sortname def initialize(sortname=nil) @sortname = sort_name end

def nameparts @sortname.split(", ") end

def familyname if @sortname =~ /,/ then return nameparts[0] else # obviously this needs work return @sortname end end

def givenname if @sortname =~ /,/ then return nameparts[1] else return @sortname end end

def suffix if @sortname =~ /,/ then return nameparts[2] else return @sort_name end end

def displayname if @sortname =~ /,/ then return givenname + "" + familyname else return @sort_name end end end

mao = Person.new("Mao Zedong") jane = Person.new("Doe, Jane") prince = Person.new("Prince")

list = [mao, jane, prince]

list.each do |person| puts "display name: " + person.displayname puts "sort name: " + person.sortname puts "-" end

Results:

$ ruby person.rb 
display name: Mao Zedong

sort name: Mao Zedong

display name: Jane Doe

sort name: Doe, Jane

display name: Prince

sort name: Prince

Feedback to TC45

Posted in Uncategorized on September 5th, 2006 by darcusb – Comments Off

Following is the feedback I just sent to the ECMA TC45, which is overseeing Open XML.

I’ve posted my analysis of the bibliographic support outlined in the latest draft here.

In rough order or priority, I think you need:

1) to change the personal name model from first/middle/last to a more international-friendly given/family/prefix/suffix/other/sort-string.

For a specification that aspires to be an international standard, it’s simply unacceptable to be using a narrow, culturally-specific, personal name model.

The solution is simple: borrow from vCard, which is well-designed and widely-implemented, with the following name properties: given/family, honorific prefix and suffix, other, and a sort-string property to account for different sorting conventions).

2) Rationalize your type list for bibliographic sources (see my post), and allow them to be extended

3) provide rules for property extension so that developers aren’t forced into an all-or-choice of what is now a very limited model

4) ideally, you need to bring the bibliographic metadata representation in line with the metadata descriptions used for the OXML document per se, and with wider standards.

For example, you can use dc and dcterms for a lot of these properties, and in particular for critical relations (notably dcterms:isPartOf and dcterms:siVersionOf) that would make the model both more flexible and more robust. Bibliographic metadata is really not flat, and the current model imposes significant limitations that will have an impact on users and developers.

None of these changes are in any ways onerous to implement in either the spec or in software, but will significantly enhance the usefulness of the bibliographic support in OXML.

I would hope too that thinking in terms of more general metadata support can also improve other areas of the spec WRT to metadata (if, for example, you have similar problems with name models elsewhere).

Names and Dates

Posted in Uncategorized on September 5th, 2006 by darcusb – 2 Comments

In playing with Zotero over the last few days I’m reminded that the two most difficult things to correctly handle in bibliographic databases and the GUIs built on top of them are names and dates.

Consider the following names:

  1. John van Doe, III
  2. Mao Zedong
  3. The Rolling Stones
  4. Prince
  5. Senate Committee on Trees, Plants, and Flowers

A database needs to store these names in ways that make it possible to reliably sort and (re)format them.

Name 1 is a more standard Western personal name, with two little twists: an articular (which may or may not be included in sorting, depending on locale), and a suffix. By convention, bibliographic software that is developed in North America or Western Europe assumes these sorts of names, and a very particular notion of the relation between display and sorting. So we have things like first name (secondary key) and last name (primary).

But name 2 points out one problem of this: not all languages have the same sorting conventions. For the name “Mao Zedong” (a transliterated Mandarin name) you sort on “Mao.” This is actually easier in many ways, since sort and display are equivalent, but not if you assume first/last names. Yes, “Mao” was his “first” name, but not at all the same kind of first name as mine.

Names 3 and 4 also throw a wrench in standard expectations; the first is a group (an organization is another sort of group), and the second a pseudonym. Name 5 shows that with group or organizational names, you have to ignore standard delimiters like commas.

So if you have fields like first and last name, you’re already severely limiting what kind of data can be stored. If you just have a single field for names or dates, then, you have to be really careful to make it clear to users how they should enter their data.

My preference on names would be a single field with a GUI hint on how to enter (as sort order, so “Doe, Jane” or “Mao Zedong”), a checkbox to indicate a group (to switch off parsing), and then a tooltip that showed the display name (how the software is parsing the name string). That seems to give the best balance of structure and flexibility.

Dates are also problematic in quite similar ways, because they just don’t fit the neat boxes of standard datatypes. Consider:

  1. November/December 2000
  2. Spring 2001
  3. Second Quarter, 2002
  4. c. 200 BC

Here I’d prefer four separate fields: year, month, day, other. This is how RIS handles dates, and it seems the best balance.

Alternately, I could image a single field, though it might be a little tricky.

Open XML, Draft 1.4

Posted in Uncategorized on September 5th, 2006 by darcusb – Comments Off

MS recently released a new draft of their Open XML format. This version includes more information about the citation and bibliographic support. Some notes …

First, citations. Really nothing to say except, yeah, they’ve documented what they’re doing a bit more (see the pdf, p1513), but no, they’ve not bothered to fix any of it. This isn’t really surprising I suppose, since they’re using generic fields to encode citations. But it still results in rather unfriendly XML.

Second, the bibliographic source format.

All the problems I noted earlier remain:

  1. the personal name model is Western—even U.S.—centric
  2. the model is (almost) totally flat and inflexible

Here are the list of types, with my own parenthetical comments:

  1. Art
  2. ArticleInAPeriodical (should be just “Article”)
  3. Book
  4. BookSection
  5. Case
  6. ConferenceProceedings
  7. DocumentFromInternetSite (should be just “Document”)
  8. ElectronicSource (which is?)
  9. Film
  10. InternetSite
  11. Interview
  12. JournalArticle
  13. MagOrNewsArticle (how is this any different from ArticleInPeriodical??)
  14. Misc (hints of a broken model)
  15. Patent
  16. Performance
  17. Report
  18. SoundRecording

Again, limited and inconsistent, and it seems fixed.

Beyond that, the fields are pretty much flat after that, so the only thing to do is list them:

  1. Author
  2. BookTitle
  3. Broadcaster
  4. BroadcastTitle
  5. CaseNumber
  6. ChapterNumber
  7. City
  8. Comments
  9. ConferenceName
  10. Country
  11. CountryRegion
  12. Court
  13. Day
  14. DayAccessed
  15. Department
  16. Distributor
  17. Edition
  18. Guid
  19. Institution
  20. InternetSiteTitle
  21. Issue
  22. JournalName
  23. LCID
  24. Medium
  25. Month
  26. MonthAccessed
  27. NumberVolumes
  28. Pages
  29. PatentNumber
  30. PeriodicalTitle
  31. PlacePublished
  32. ProductionCompany
  33. PublicationTitle
  34. Publisher

As before, because of the flat model, we have six different title properties for the same thing: a related title. And the fields are fixed, and uncontrolled in the schema (the properties are just a blunt zero-or-more choice list).

In other words, on the one hand we have a relatively limited data model that does not reflect the kind of complexity and variability of real world citation data. On the other hand, it’s reflected in an incredibly loose schema that cannot be extended. The first problem is compounded by the second.

Finally, the one place where there is some more structure is contributors.

Awkwardness 1: the main element is Author, but in fact is a far broader Contributor, since the children of Author include:

  1. Artist
  2. Author
  3. BookAuthor (?? must be another awkward consequence of the flat model)
  4. Compiler
  5. Composer
  6. Conductor
  7. Counsel
  8. Director
  9. Editor
  10. Interviewee
  11. Interviewer
  12. Inventor
  13. Performer
  14. ProducerName
  15. Translator
  16. Writer

This is actually one of the few things I like about the schema, aside from the above-mentioned weirdness of paths like b:Author/b:Author.

I find the contributor name model particularly surprising in a format that aspires to be an international standard. The first/middle/last name tradition is quite culturally-specific, and I can only guess what Asian users will think about this, or Western users who need to deal with Asian sources. What’s even more frustrating is, it’s easy for them to fix.

Hopefully we’ll see some improvements in the next draft.


Creative Commons License Creative Commons License