Site Scraper

Tags

Accessibility AI Analytics ARM Template Art Azure Blob Storage Brightcove Cache Caching Catching Exceptions Chatbot Compiling Content Search Continuous Integration Copyleft Creativity CSharp Datasource Design Patterns DevOps DotNet Editor Tabs Experience Editor Extensions Extranet Field Suite First Fun Generics Genetic Algorithm Griping HTML Integration Testing Interfaces Java Javascript JQuery Layouts Linq Localization Logic Lucene Machine Learning Microsoft Microsoft Cognitive Services MongoDB Multi Site MultivariateTesting Page Editor People Personalization PHP Propaganda SDN Security SharedSource Sheer UI Sitecore Sitecore Cognitive Services Sitecore Community Sitecore Dictionary Sitecore Events Sitecore Experience Sitecore Express Sitecore Fields Sitecore Marketplace Sitecore Marketplace Sitecore Modules Sitecore Rules Sitecore Security Sitecore Symposium Sitecore Upgrade Software SQL Stories Strategy Sublayouts SVN Tactics Templates Test Driven Development Umbraco Unit Testing UserManager War Web Controls Web Testing WebApp Website What Whom Why Wysiwyg Xaml xDB XML XPath Yellow Screen Of Death YouTube

< Previous Post Next Post >

July 15, 2013

Tags:

I just finished cleaning up a utility I built a few years ago and pushed it up to GitHub. I often found myself needing to have an html export of some sites so that design vendors could update the css/js, or some sites were being moved to different systems in another part of the company. I've cleaned up the design so that it's less distracting and updated how it works so that you get what you want in the end: a clickable website in a folder.

Here's a screenshot:

site scraper

There's two parts that I built: the Xenu list processor and the link processor. I use the Xenu site crawler to first get a list of all the css, js, pdfs, images and page urls that the site has on it. Then I export that list to a tab separated file. With the first form, I remove all the extra information from the Xenu link list. The form also creates a new file that can be loaded into the drop-down list of the second form. There's also a few options you have to work with such as include/exclude filtering and separating the links for files, images and pages into separate files. Sometimes it's easier to process long lists of files separately. I found it useful so it's in there.

The second form will take the list generated by the first form, or really any list of links you provide it and will then request for the content of that page and save it to a file. The files generated will be foldered to match the existing site and will replace all the internal site links to relative paths so that when you open the files in a browser you can click to other pages. You do of course have options so that you can opt to not update the urls to relative paths, you can save the pages down as a different file type, you can set what the default page name will be and you can append the querystring to the filename so you can capture separate states of an individual page if you need to.

I've found it incredibly useful and hopefully other people will also.

< Previous Post Next Post >

Site Scraper

Recent Posts

Recent Comments

Archives

Tags