Posts Tagged ‘Word 2007’

Feedback to TC45

Posted in Uncategorized on September 5th, 2006 by darcusb – Comments Off

Following is the feedback I just sent to the ECMA TC45, which is overseeing Open XML.

I’ve posted my analysis of the bibliographic support outlined in the latest draft here.

In rough order or priority, I think you need:

1) to change the personal name model from first/middle/last to a more international-friendly given/family/prefix/suffix/other/sort-string.

For a specification that aspires to be an international standard, it’s simply unacceptable to be using a narrow, culturally-specific, personal name model.

The solution is simple: borrow from vCard, which is well-designed and widely-implemented, with the following name properties: given/family, honorific prefix and suffix, other, and a sort-string property to account for different sorting conventions).

2) Rationalize your type list for bibliographic sources (see my post), and allow them to be extended

3) provide rules for property extension so that developers aren’t forced into an all-or-choice of what is now a very limited model

4) ideally, you need to bring the bibliographic metadata representation in line with the metadata descriptions used for the OXML document per se, and with wider standards.

For example, you can use dc and dcterms for a lot of these properties, and in particular for critical relations (notably dcterms:isPartOf and dcterms:siVersionOf) that would make the model both more flexible and more robust. Bibliographic metadata is really not flat, and the current model imposes significant limitations that will have an impact on users and developers.

None of these changes are in any ways onerous to implement in either the spec or in software, but will significantly enhance the usefulness of the bibliographic support in OXML.

I would hope too that thinking in terms of more general metadata support can also improve other areas of the spec WRT to metadata (if, for example, you have similar problems with name models elsewhere).

Open XML, Draft 1.4

Posted in Uncategorized on September 5th, 2006 by darcusb – Comments Off

MS recently released a new draft of their Open XML format. This version includes more information about the citation and bibliographic support. Some notes …

First, citations. Really nothing to say except, yeah, they’ve documented what they’re doing a bit more (see the pdf, p1513), but no, they’ve not bothered to fix any of it. This isn’t really surprising I suppose, since they’re using generic fields to encode citations. But it still results in rather unfriendly XML.

Second, the bibliographic source format.

All the problems I noted earlier remain:

  1. the personal name model is Western—even U.S.—centric
  2. the model is (almost) totally flat and inflexible

Here are the list of types, with my own parenthetical comments:

  1. Art
  2. ArticleInAPeriodical (should be just “Article”)
  3. Book
  4. BookSection
  5. Case
  6. ConferenceProceedings
  7. DocumentFromInternetSite (should be just “Document”)
  8. ElectronicSource (which is?)
  9. Film
  10. InternetSite
  11. Interview
  12. JournalArticle
  13. MagOrNewsArticle (how is this any different from ArticleInPeriodical??)
  14. Misc (hints of a broken model)
  15. Patent
  16. Performance
  17. Report
  18. SoundRecording

Again, limited and inconsistent, and it seems fixed.

Beyond that, the fields are pretty much flat after that, so the only thing to do is list them:

  1. Author
  2. BookTitle
  3. Broadcaster
  4. BroadcastTitle
  5. CaseNumber
  6. ChapterNumber
  7. City
  8. Comments
  9. ConferenceName
  10. Country
  11. CountryRegion
  12. Court
  13. Day
  14. DayAccessed
  15. Department
  16. Distributor
  17. Edition
  18. Guid
  19. Institution
  20. InternetSiteTitle
  21. Issue
  22. JournalName
  23. LCID
  24. Medium
  25. Month
  26. MonthAccessed
  27. NumberVolumes
  28. Pages
  29. PatentNumber
  30. PeriodicalTitle
  31. PlacePublished
  32. ProductionCompany
  33. PublicationTitle
  34. Publisher

As before, because of the flat model, we have six different title properties for the same thing: a related title. And the fields are fixed, and uncontrolled in the schema (the properties are just a blunt zero-or-more choice list).

In other words, on the one hand we have a relatively limited data model that does not reflect the kind of complexity and variability of real world citation data. On the other hand, it’s reflected in an incredibly loose schema that cannot be extended. The first problem is compounded by the second.

Finally, the one place where there is some more structure is contributors.

Awkwardness 1: the main element is Author, but in fact is a far broader Contributor, since the children of Author include:

  1. Artist
  2. Author
  3. BookAuthor (?? must be another awkward consequence of the flat model)
  4. Compiler
  5. Composer
  6. Conductor
  7. Counsel
  8. Director
  9. Editor
  10. Interviewee
  11. Interviewer
  12. Inventor
  13. Performer
  14. ProducerName
  15. Translator
  16. Writer

This is actually one of the few things I like about the schema, aside from the above-mentioned weirdness of paths like b:Author/b:Author.

I find the contributor name model particularly surprising in a format that aspires to be an international standard. The first/middle/last name tradition is quite culturally-specific, and I can only guess what Asian users will think about this, or Western users who need to deal with Asian sources. What’s even more frustrating is, it’s easy for them to fix.

Hopefully we’ll see some improvements in the next draft.

Politics, MS and ODF

Posted in Uncategorized on July 19th, 2006 by darcusb – 1 Comment

Two comments from people at Microsoft on the suggestion (from me and others) that they join the OpenDocument Technical Committee to help ease interoperability gaps in the two formats going forward; first Brian Jones:

I think there are still plenty of ways we can help out the OASIS folks with the ODF format. The entire translator project is open source, so the conversion will be completely transparent and everyone will have the ability to benefit from what we discover as the transformations are built. In addition to that, as I’ve looked through our Ecma documentation, I’ve also been looking at the ODF spec as a point of comparison. As I come across areas that are either missing, or just not fully specified, I’ll be sure to point them out on my blog. That should help them in creating a list of areas to improve.

On one hand, this sounds quite generous. To this I say, sure Brian, that’d be great.

But if you parse the language (and my career is just doing just that) it reflects the arrogance of a company that has for too long gotten by on the weight of its own monopoly position. Note: he does not acknowledge that MS might learn something from the experience (see below), and that OXML might be better for it. Likewise, he doesn’t acknowledge that OXML has already borrowed from ODF; for example, in its zipped package file structure.

Now, here’s Dare commenting on Brian’s post:

Unfortunately, the ODF discussion has seemed to be more political than technical which often obscures the truth. Microsoft is making moves to ensure that Microsoft Office not only provides the best features for its customers but ensures that they can exchange documents in a variety of document formats from those owned by Microsoft to PDF and ODF.

Make no mistake: there is something “political” in this position that MS is staking out, which seems to be:

  1. see, we are just as open as ODF?
  2. but ODF is a weak spec that pails in comparison to the technical excellence of Open XML
  3. MS is giving the people what they really want, which is file format support; witness the new BSD licensed ODF plug-in for Office

IBM’s Rob Weir is starting to pay some careful technical attention to these sorts to details. In his latest, he argues the heavy weight of OXML is going to introduce serious implementation, and thus interoperability, problems.

He addresses this through the 50+ pages of references to an obscure feature of page art borders. Yes, the spec actually includes these details! And as Rob points out, this sort of functionality is quite culturally-specific.

The images are heavily weighted to Western even Anglo-American celebratory icons, things like gingerbreadmen for Christmas or slices of Birthday cake, pumpkins for Halloween, or images of Cupid for St. Valentines day, or globes which are neatly centered on the United States.

Rob argues this is a perfect example of over-the-top spec bloat that will make implementation awkward for anyone but MS. Moreover, Rob actually provides an elegant alternative suggestion.

All of these problems (spec bloat, cultural bias, non-extensibility, copyright concerns) can be solved by one simple mechanism. Instead of having ST_Border be a fixed enumerated set of values, have it include only a small number of trivial values like the basic line styles, and have everything else (all of the Art Borders) be stored as a separate image file in the document archive.

Excellent!

Brian, you listening?

Elsewhere, Rob does a good job analyzing just how well MS is doing by their users in the ODF plug-in GUI and import quality.

Meanwhile, I have extensively pointed out where MS ha fallen down in their new citation support. They have invented their own source format, have ignored library communications standards, and appear to be using critical citation coding that will be impossible for standard xpath-based XML tools to process. Some of this has implications for the file format, and I’ve yet to see any serious concern about the issues out of Redmond.

To be fair, them inventing their own source format is no big deal, since there aren’t any good standards here. Still, my other critiques apply.

Despite what it might seem, my position on these matters isn’t blindly political. I believe in open standards because I think in the end they yield better results for end users. I expect to prove that with the citation use case, but I really do want to raise the bar for academic end users all around. Enhancing interoperability between ODF and OXML is an important part of that, and both groups can learn from each other.

Atom and Citation Styles

Posted in Uncategorized on July 15th, 2006 by darcusb – 1 Comment

Jennifer Michelstein, the Microsoft program manager for academic features, has posted the first of a series of blog entries on the new citation support in Word 2007. In comments, we have been going back-and-forth on a few issues of concern.

Out of that conversation, I conclude:

  1. version 1 will not support the footnnote/endnote style citations common in the humanities
  2. it seems (though I’ve not confirmed it) they don’t support first/subsequent citations in author-year styles
  3. they will provide an SDK to connect remote databases, though no evidence that they are even aware that there are well-deployed existing standards (z30,.50, and the more modern SRU and SRW equivalents) in this space
  4. they think it quite fine to leave it to third-parties to provide different styles for users, in XSLT

As a I said in the comments, I think the last point particularly short-sighted. It will be really hard for anybody but XSLT experts to write good styles, and they will be specific to Word. Moreover, the style files will be huge, and difficult for users to install (impossible in some cases, if they don’t have appropriate installation rights).

Nevertheless, it does mean there’s plenty of room for someone to swap out the existing raw XSLT approach and replace it with my (I think much better) citeproc alternative. And I’ve been talking to M. David Peterson about just that.

Mark had an idea that I think may well be brilliant: use Atom to do much of the metadata and distribution work. It turns out that the current metadata element in CSL is almost exactly (in fact, by design) the same at the Atom metadata content. Moreover, the rest of a CSL file could be easily embedded in the atom:content element, and then individual entries linked together into feeds.

So what if, then, users never had to worry about installing style files? They would just subscribe to one or feeds in their areas. If a new style they wanted appeared, they’d click a link and it would be automatically installed. If an updated version of an already-installed style showed up, the local version would be automatically updated. Plus, it ought to be possible to embed an XHTML preview of the style somewhere (perhaps in the atom:summary element?).

Finally, because CSL is designed to be document-format agnostic, the same files could be used by users of any authoring solution: Word, OpenOffice, Writely, LaTeX, DocBook, web applications, etc.

There’d still be details to work out (I’d really like to allow distributed repositories), of course, but doesn’t this seem much better than the current MS approach?

PyULike vs. SmartFox: Centralized vs. Distributed

Posted in Uncategorized on July 9th, 2006 by darcusb – 2 Comments

CiteULike’s Richard Cameron has posted an interesting outline of a plan to rewrite the code in Python, called for now PyULike. Meanwhile, last week I heard from one of the developers of a really interesting new in-development Firefox plug-in called Firefox Scholar (we are talking about integrating my CSL language for citation processing, as well as import/export formats). Each attempts to solve very real problems for scholars, researchers, and students, but in quite different ways.

The problems are:

  1. How can you best integrate reference management seamlessly into modern web-focused research workflows? As a user, I spend a lot of time working with documents sourced from the web, so why should I then have to open a desktop application and manually enter reference data?
  2. How can one exploit the web and its network effects to allow users to benefit from the social aspects of reference management? It’s really hard to keep up with new work in my own field, let own affiliated ones, so why can’t my reference management solution give me hints once in awhile based on what others with like interests are reading?

Now, how do they solve these problems?

PyULike, like its predecessor, is based on a fully-centralized model. To quote Richard:

Previously I’ve resisted releasing or “open sourcing” code for the site for reasons which I outline on the site’s FAQ. Briefly, these are that I wish to prevent fragmentation of the userbase among a thousand private installations of the CiteULike software…. The benefits of keeping things centralised is that we keep the community effects. Users find others who are reading the same material, and they find papers serendipitously which they wouldn’t otherwise.

Firefox Scholar—aka SmartFox—is based on a slightly different, more distributed, model. Reference data will be stored locally, within Firefox 2.0’s embedded SQLite database. One will be able to extract references from pages one is browsing, or also manually enter and edit references within Firefox.

They then plan to add the ability to sync that data with a centralized server to provide similar sorts of social networking support. Moreover, it will be fully open-sourced, under a GPL license.

OK, but as a user and developer, I’m not so sure I want to be left with such discrete—all-or-nothing—choices. Why couldn’t I, for example, use SmartFox locally in my browser, but have it sync with PyULike’s server? Or more importantly, I don’t accept the notion that a centralized server and social networking are mutual requirements. Can we not allow the sort of vision of these tools but in more distributed fashion? RDF and SPARQL, Atom?

Finally, I’ll reiterate the point I’ve repeatedly made: we need to get this stuff integrated within the desktop and publishing workflow. If I’m using PyUlike or SmartFox (or both) I really need to be able to easily integrate my citations into Word or OpenOffice. MS is already adding the infrastructure to allow this in Word 2007, and we are trying hard to make the same happen at OpenOffice. Until that happens, only part of the puzzle is in place.

So how about some collaborative discussion among these projects so that we can have real interoperability, not only between these projects, but also between them and OpenOffice and Word? Maybe we could even settle on compatible licenses so that we can share code where appropriate.

The Chronicle Does Citation Software

Posted in Uncategorized on June 24th, 2006 by darcusb – Comments Off

The Chronicle has an article (which I stumbled on here) on citation management software. A couple of interesting excepts:

And a few faculty members have tried the software for their own research and then gone back to tried-and-true manual methods. One is Lowell Turner, a professor of international and comparative labor and collective bargaining at Cornell. He says his graduate research assistants urged him to use RefWorks, but he found that the program couldn’t quickly or easily import a career’s worth of bibliographic material, in a variety of formats.

Another similar point, though this one hitting on the data model theme I’ve focused on extensively here:

He’s not alone. Even though legal scholarship follows exceedingly detailed citation rules that seemingly would be well suited to a computer program, legal scholars as a whole avoid citation software, says Kevin M. Clermont, a law professor at Cornell. Legal scholars often cite arcane documents from around the world, which citation software has difficulty handling, he said.

“It’s by light years not sophisticated enough to handle our problems,” he says.

Yup, I feel his pain, and unless MS fixes their data modelling approach, I’m afraid their new support won’t work for him either. Am hoping we can get it right at OpenOffice though.

Finally, on the costs:

Since November 2003, almost 11,000 people at the University of Minnesota-Twin Cities have registered for RefWorks, and they have stored a total of 570,000 references on RefWorks’ servers …

How much do they pay for this? $12,500 per year.

Sigh … so how much would it take to build a better open source solution using PostgreSQL and Ruby on Rails? If each institution that had such a site license put, say, $500 in a pot? No, that doesn’t include all the support issues involved in such an enterprise, but how hard can that really be?

Regretably, the article focuses solely on RefWorks and Endnote. There’s no mention at all about the forthcoming support in Word, nor the OpenOffice work I’m involved in. In both cases, these efforts will offer superior integrated citation formatting support to word processors.

Likewise, there’s no mention of interesting developments in the world of free services and software like Connotea and CiteULike. Admittedly, neither of these are general enough to serve as real substitutes, but I think it’s only a matter of time before they are.

So nice to see the article, though it seems strangely dated.

Wither Apple?

Posted in General on June 16th, 2006 by darcusb – Comments Off

I’ve not written about Apple in awhile. Mark Pilgram’s announcement that he’s switching to Linux after 22 yeass on the Mac, and the absolutely absurd comments from the Mac zeolots just reminds me that I’ll also likely be following Mark’s lead next time I buy hardware, and for most of the same reasons. I don’t have the interest to go into it in depth, but in short, their software is uninspiring, the company is more closed and standards-unfriendly than even Microsoft, and their hardware is expensive. I don’t buy music from iTunes, and I don’t ever plan to. The only software I care about that is not on Linux is the advanced photo-editing applications like Photoshop and Lightroom.

Of course, therre’s still this little issue I’ve been obsessing about regarding citation support, but for now the stuff being added to Word 2007 is nowhere in sight on the Mac.

Flat vs. Relational

Posted in Uncategorized on June 16th, 2006 by darcusb – Comments Off

Now that I’ve covered most of the details citations and bibliographies in Word 2007, let me return to the subject of the source format. The team that designed the schema made a number of design decisions. In designing the equivalent for use in OpenOffice and OpenDocument, I have made some different decisions. Similar debates have accompanied the effort to put together an hCite micro-format. Let’s compare.

So the structure of the bib schema in Office 2007 (and Brian Jones tells me, Open XML; this will be documented in the ECMA format) is a flat model, with a root of b:Sources, and primary child elements of b:Source. Typing is provided by a b:SourceType element. All properties of the bibliographic item are then described with child elements of b:Source; there is no hierarchy. So, for example, to encode titles:

  • for a Book, you use b:Title
  • for a BookSection, you use b:Title for the chapter title, but b:BookTitle for the container
  • for the journal article title, as above you use b:Title
  • for the journal title, you use b:JournalTitle
  • etc., etc.

The problem with this approach is you end with an explosion of elements to describe the range of resources. I count 9 elements that are used to describe the same thing: titles (though currently they incorrectly assume a Case “Reporter” is a contributor; rather, it’s a periodical title). And they are missing a few: CollectionTitle and SeriesTitle are the obvious ones. Essentially, every new resource type—particularly if at has some part-container relation—needs a new title structure! And every time you add a new title structure, you have to update code elsewhere (in, for example, every single XSLT file that implements your citation styles!).

Also, the modeling is inconsistent, both internally, and with respect to the document-level metadata description in OXML. On the former, a simple example: the title of a book is b:Title, except when you are describing a section within the book, at which point it is a b:BookTitle. On the latter, OXML now uses DC to describe documents, but here we see no evidence of DC.

There’s another problem, incidentally, with the structure of the MS schema, which is more a limitation of the validation technology they are using (XML Schema) than anything. Because they use the same element for all types, they cannot validate the content by type. So it will be possible, for example, to include a b:BookTitle element within a journal article record. RELAX NG has no such limitations, but the schema isn’t expressed in RELAX NG.

My approach, by contrast, is not flat, but relational. I use RDF for the relational modeling and linking. In the XML, I use typed nodes to encode the important information, which means one need only have two title structures: title and shortTitle. Conceptually, then, you end up with:

Article
   title
   isPartOf
      Journal
         title

And the majority of critical properties can be represented with standard DC and Extended DC; the same ones, incidentally, OXML already supports for the document!

Finally, an XML schema (expressed in RELAX NG) tightly controls the structure of the content by type.

More broadly, using a relational structure in which you keep the number of properties to a minimum has further benefits. The formatting system, for example, can be made much more robust.

(X)Forms in Biblilographic Apps

Posted in General on June 15th, 2006 by darcusb – Comments Off

Awhile back I wrote that a new bibliographic web application ought to include:

A configurable form system flexible enough to be configured for any resource type: everything from journal articles to books, to archival documents, to weblog posts. This presumes the form system should not be based on RIS or BibTeX, but rather around a more flexible standard like MODS. Either XML or YAML would be good bets for configuration languages in Ruby or Python.

I probably mentioned the idea of using a simple XML language to configure the GUI elsewhere too. In any case, MS has done just that in Word 2007:

So it seems the entire editing forms are configured with this XML file. In fact, I bet (though cannot now test) that one could add custom types by simply editing this file.

Interestingly, the author definition includes an assocaited XSLT that converts a simple string to properly-structured XML, and another to convert the other way (though I still hate that it all—including the XML—presumes standard Western name forms; what if I am a scholar of Chinese history?). I wonder, can you do this in XForms?

I’ve been saying for awhile that OOo needs to deepen XForms support to open it up to developers for these sorts of uses. This would be particularly interesing when coupled with the idea that a couple of the Sun engineers were discussing at the ODF metadata SC of creating a standard RDF XForms binding for our metadata work. That could GUIs to be essentially auto-configured for custom content.

Opening Up the Market

Posted in Uncategorized on June 13th, 2006 by darcusb – Comments Off

I said in my last post that:

I cannot emphasize enough how important it is that this stuff be standardized within document formats and included within editing applications. It’s critical, and the sad state of the current market is a direct consequence of the fact that it is not.

What I am saying may seem paradoxical: that including standard support commonly found in third-party plug-ins will actually open up the market, rather than close it. This is so only, however, if one can use alternate data sources. I should, put simply, be able to have Word access RefWorks, or Endnote, or whatever reference management software I want.

Thankfully, there’s a fairly easy way for Microsoft to allow this: tweak their Research Pane a bit.

Right now, the “insert citation” button on the Word ribbon includes an option to “search libraries.” When you click it, it brings the Reearch Pane up. Good!

Sadly, it doesn’t do anything useful (yet). What it should do is give default access to the Library of Congress SRU/W gateway, and to MS’s Academic search service. Further, it should be trivial to add any new data source to this.

Also, a user ought to be able to drag-and-drop the search results onto the document to cite them. I think this does suggest some enhancements to the Research Pane, including removing the requirement to use SOAP. RESTful web service are winning th day, and MS ought to support them.

Problem solved … mostly. We now have good standard base support, but open up options for different kinds of users and user communities, as well as developers.

One problem with this approach, however, is that it puts a lot of burden on the source data format for interoperability, and right now, it is rather more limited than it should be to fulfill that requirement.

Incidentally, everything I’ve been saying is pretty much what we’ve been advocating at the OpenOffice bibliographic project. While it could be coincidence, can’t help but wonder if people at MS haven’t been paying attention, and if we haven’t unintentially done a bit of design work for them!

Update …

From MS’s Chris Pratley, on some forum, more info:

Word 2007 comes with a citation library capability, and by the time we ship it will have connections to on-line reference libraries so you can search for citations and download them to your local library. In beta 2 you have to manually enter citations, but you can keep them in your library and re-use them in different docs.

Word 2007 beta 2 has a set of the most common citation formats (MLA, APA, etc.), and this can be expanded either by end users (need to edit an XML file), or by third parties or Microsoft in the future. We expect a lot of people to add more formats you can download so you don’t have to make them yourself. We’re just two weeks into public beta so that hasn’t had a chance to happen yet.

So seems like good news, though his explanation on citation styling is cryptic.


Creative Commons License Creative Commons License