Seize the Paginae, Part C: HTTrack

Nov1

No Comments

Seize the Paginae, Part C: HTTrack

Posted in: Technology

And now for a third post in my trifecta of tools for information viewing and copying – the free HTTrack Website Copier allows you to download the entire Internet to your local drive.

No, not really – but their web site does a nice job of explaining what it does:

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site’s relative link-structure.

Simply open a page of the “mirrored” website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.

Last year I was working my way through reading a fairly large set of local history interviews and photos from the town near where my grandmother was born. These interviews contained interesting stories and I believe more than a few clues to the relationships of various families in the area.

Then, one day I noticed a new post on the home page of the website: due to lack of local interest, it was to be taken down within several weeks. To avoid losing access to this resource, I broke out my trusty HTTrack, pointed it at the web site in question, set a local directory on my PC to save it in and let it rip.

When finished, I could load the local copies of the interview pages and surf the site on my hard drive as if it were the original one. That includes the text and all images – no fuss, no muss!

It is also a nice tool to archive the contents of a changing site over time – A surname project I follow has lost some original (and useful) information over time, but I can still reference the previous site via my local archives.

For reference, the Wayback Machine at the Internet Archive can also provide snapshots of how a site looked in the past – but it can be hit and miss, as many smaller sites do not get archived regularly or even at all.

That was the case in this particular instance – the interviews site was last archived by the Wayback Machine back in 2009 – but only the home page (without formatting!) was available.

Back to HTTrack – while the interface might look a little old school and rough around the edges, using it with the default settings is fairly straight forward. There are quite a few technical settings that can be tweaked as well, so to avoid getting too technical here, I’ll just point to the documentation page. It is available in Windows, OSX and various Linux flavors.

You might not need this tool every day, or even very often – but when you do it sure beats surfing and copying website contents by hand.

One final important thing to mention – this blog post is about the usefulness of an automation tool and purposely does not address copyright issues. Remember to respect copyright and to request permission where needed.

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URL