Download News Script
Posted in Uncategorized on September 28th, 2004 by darcusb – Comments OffWith help from various places (the new Ruby Forum, the author of the new Tidy package for Ruby, Matthias; thanks all!), I’ve now got the following script for my download issue.
So, I have an Applescript I invoke from NetNewsWire to create a YAML config file with a list of article titles and urls.
tell application "NetNewsWire"
set articleurl to (URL of selectedHeadline)
set articletitle to (title of selectedHeadline)
set article_pubdate to ((date published of selectedHeadline) as string)
end tell
set delim to AppleScript's text item delimiters
-- Get the cleaned URL set AppleScript's text item delimiters to "?" set theUrl to text item 1 of article_url
set AppleScript's text item delimiters to delim
set newline to ASCII character 10
set theData to (newline & "-" & newline & " title: " & articletitle & newline & " url: " & theUrl & newline & " pubdate: " & articlepubdate) & newline
set theFilePath to ((path to documents folder) & "News:downloads:downloadIndex.txt") as string
set fileRef to open for access theFilePath with write permission set fileEOF to get eof fileRef write theData to fileRef starting at (fileEOF + 1) close access fileRef llowing Ruby script then reads that file, downloads the html file from the url, runs Tidy on it to convert to XHTML, and then runs an XSLT on that to create final (very clean) output (including passing the url as parameter to saxon for insertion in the document header).
It could no doubt use much more work (error handling, support for different sites, etc.), but it’s a good start. It uses standard Ruby libraries, except for the Tidy package, which is new.
$TIDYLIB = '/usr/local/lib/libtidy.dylib'
require 'tidy'
require 'yaml'
require 'open-uri'
config
datafile = 'downloadIndex.txt' tidyconfigfile = 'tidyconfig.txt'
Fetch pages
Tidy.open do |tidy|
tidy.loadconfig(tidyconfigfile)
YAML.load(File.read(datafile)).each do |doc|
title,url = doc.valuesat('title', 'url')
name = title.gsub(/s/, '')
file = name + '.html'
file_xhtml = name + '.xhtml'
uri = url.dup
uri = 'http://whatever.com/'+url if url =~ /bbc/
uri << '?hp=&pagewanted=print' if url =~ /nytimes/
File.open(file, 'w') do |article|
puts "nbegin processing "#{title}" ..."
page = open(uri, 'referer'=>'http://www.google.com') { |io|
io.read }
article.write(tidy.clean(page))
end
saxon -o #{file_xhtml} #{file} ../clean.xsl url="#{url}"
puts "... finished processing "#{title}""
end
puts "ndone.nn"
end

Creative Commons License