XML DBs as Research Tools

Borrowing an idea from Jon Udell and with a lot of help from eXist XML DB author Wolfgang Meier, I worked out a quick-and-dirty way to exploit XML and the XQuery-based eXist to do content analysis.

The story: I had to analyze a bunch of news documents; roughly 150 to be more precise. So, I downloaded them, used a simple shell script to run Tidy on all of them and convert them to clean XHTML, then ran an XSLT on those to clean them up further. From that base, I then went through and highlighted important content with keywords by using the class attributes on the span, p and q tags. I then used an XQuery script (pretty much written by Wolfgang) to pull out all paragraphs that contain the highlighted chunks of content, and organize them by document title. Finally, I used CSS to render the content (though this is still very much a moving target; how to represent more than one keyword, for example?).

I’ve never used a content analysis application, but this suggests to me an XML DB like eXist not only can serve as an excellent bibliographic application, but can also further tie together data (research content) and metadata (bib records). As an example, I added meta tags to XHTML headers, and wrote a stylesheet to generate MODS records from them automatically. While generating the metadata in the XHTML is time-consuming, it is less so than manually creating each MODS file, and also serves double-duty by enhancing future access to the records.

update: what I really want is to be able to have a button on my web browser that allows me to highlight an excerpt in the XHTML file, click it, and have the DocBook citation code pasted to the clipboard. Maybe with Javascript?

update 2: there seems to be a bug in MT (or maybe a virus) that replaces a posts content with deleted comment spam content. I had to go the google cache to recreate this entry. In other news, I got some help from Alf Eaton on the bookmarklet issue above.

Comments are closed.


Creative Commons License Creative Commons License