Site Scraper

July 15, 2013
Tags:

I just finished cleaning up a utility I built a few years ago and pushed it up to GitHub. I often found myself needing to have an html export of some sites so that design vendors could update the css/js, or some sites were being moved to different systems in another part of the company. I've cleaned up the design so that it's less distracting and updated how it works so that you get what you want in the end: a clickable website in a folder.

Here's a screenshot:

site scraper

There's two parts that I built: the Xenu list processor and the link processor. I use the Xenu site crawler to first get a list of all the css, js, pdfs, images and page urls that the site has on it. Then I export that list to a tab separated file. With the first form, I remove all the extra information from the Xenu link list. The form also creates a new file that can be loaded into the drop-down list of the second form. There's also a few options you have to work with such as include/exclude filtering and separating the links for files, images and pages into separate files. Sometimes it's easier to process long lists of files separately. I found it useful so it's in there.

The second form will take the list generated by the first form, or really any list of links you provide it and will then request for the content of that page and save it to a file. The files generated will be foldered to match the existing site and will replace all the internal site links to relative paths so that when you open the files in a browser you can click to other pages. You do of course have options so that you can opt to not update the urls to relative paths, you can save the pages down as a different file type, you can set what the default page name will be and you can append the querystring to the filename so you can capture separate states of an individual page if you need to.

I've found it incredibly useful and hopefully other people will also.