Downloading News?
Anyone out there know of any simple scripts (preferably Python or Ruby, or maybe Bash, or even Applescript) that I can use to download articles from a password-protected site like the New York Times?
I want to download the “printer” version of the files, run Tidy on them to convert to XHTML, and then run an XSLT stylesheet I’ve written to clean them up further (for, say, creating MODS records, or dumping in an XML DB and querying them).
I’m not exactly sure of my ideal workflow. Perhaps invoking a script from a menu within my newsreader? I have feeds for the New York Times; so that seems the simplest and quickest route. Still, a commandline interface is totally fine as well.
update: I found this on using wget to grab stuff from sites like the New York Times.
update 2: With a little help from Brent Simmons (author of my newsreader), I’ve got an Applescript that takes the URL for the highlighted article, and then passes it to wget for download. OK, good. The problem is that the URL from the feed has a bunch of cruft added to the end that doesn’t allow me to get the specific (printer-friendly) page I want. Does anyone have any idea how to take the URL as variable, but to remove everything after the “.html” in Applescript?
Creative Commons License
Take a look at Fetchyahoo!. It is in Perl and it does all this stuff for Yahoo! Mail.
Thanks for that. I would thought downloading mail would require really different code than do the same for a webpage. Is that not so?
BTW, I’ve changed the site to require comment authorization (am tired of the spammers).
I’ve done similar things on a Windows box using wget from the unxutils package, Tidy, and Instant Saxon, all run from a DOS batch file. But you could do it all with Ant using a get, a jtidy, and an xslt task. The jtidy task is part of jtidy and has to be incorporated into Ant, but it’s not complicated. Ant is definitely worth getting to know for this kind of work: if everything else you’re doing is in XML, it makes sense to be able to generate batch jobs in XML.
And what about handling usernames and passwords in wget (or curl, or Ruby’s open-uri)?
WWW::Mechanize is your friend.
In Applescript, you can split a string on a particular sub-string by specifying this sub-string as text item delimiter:
set originalURL to "http://netapps.muohio.edu/blogs/darcusb/darcusb/archives/2004/09/14/downloading-news"– save original text item delims:
set oldDelims to AppleScript's text item delimiters– set text item delims to fit your needs:
set AppleScript's text item delimiters to {"downloading"}– extract the sub-string you’re interested in:
set strippedURL to item 1 of (every text item of originalURL)– restore original text item delims:
set AppleScript's text item delimiters to oldDelimsThe variable strippedURL will contain “http://netapps.muohio.edu/blogs/darcusb/darcusb/archives/2004/09/14/” now. In your case, you may want to use “.html” (instead of “downloading”) as text item delimiter.
Try out http://www.scriptsearch.com or even … google.com or yahoo.com ; they may help you a lot!