Skip to content

Scraping Web Pages

So I’m trying to scrape the watchtower.org’s New World Translation of the Holy Scriptures. They’ll send you a paper version if you’d like and they have a free online version, but I’m afraid a digital edition is out of the question. It’s for a good cause, my friend would like to read the watchtower’s version but doesn’t want to actually carry a bible.

So far I’m using a script to increment web page addresses and pass them to wget. After wget pulls the html for the page I pipe the output to html2text. Of course html2text misses some of the emphasis tags along with subscript and picks up some of the cruft at the top and bottoms of the web pages. I’m trying to pick up enough sed and awk to pass the remaining stuff through a stream editor to cut out the cruft.

I’ll post the script once it’s finished.

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*