Top Page >>A Guide for Archiving Web Pages >> Tools for archiving web pages

A Guide for Archiving Web Pages

header image

Detection and downloading of pages as they change over time

Web archiving involves the detection, capture, and storage of all types of web content including HTML web pages, style sheets, javascript, images, and video. If meta tags for the pages are kept in separate files rather than in the pages themselves, they are included as well. Web archivists also create and store their own separate sets of meta tags about the collected resources including time and date of collection, extent of pages collected, and encoding protocols used for transmission. These meta tags help assure the authenticity and provenance of the archived collection.

There are tools that can be used to both detect and download pages as they change over time.

In order to simply detect changes in pages, see Monitoring changes to web pages, an annotated list of detection tools from Rhodes-Blakeman Associates (2008). This list does not mention tools that work within specific browsers, such as the Firefox Update Scanner.

For downloading changed pages, see the short, annotated list of web harvesting tools in the Preservation of Web Resources Handbook, (pdf) from the University of London Computing Centre, pp. 23-27 (2008). The Harvesting Software subsection of the Tools section of Harvard University's Web Archiving Resources has an extensive list of tools with annotations and links to sources (2008). The National Archives and Records Administration (USA) has a similar annotated Resource List of harvesting tools (2005).

The most highly regarded harvesting tool for general use is the HTTrack Website Copier. According to its homepage, HTTrack "allows you to download a World Wide Web site from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site's relative link-structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI."

More useful than HTTrack, however, is the set of integrated harvesting tools called the Web Curator Tool (WCT). This higly-regarded and widely-used project is an open-source, free-of-charge toolset "for managing the selective web harvesting process. It is designed for use in libraries and other collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process."

WCT can capture whole web sites as well as individual web pages. It harvests a broad range of file types including HTML pages, images, PDF and Word documents, as well as multimedia content such as audio and video files. It supports selection of pages, requests for permission to harvest (if needed), description of pages, determination of scope and boundaries, the scheduling of harvests, the actual harvesting, performance of quality reviews, and deposit of harvested pages in a digital repository or archive.

Some resources on detection and downloading of web pages

Top Page >>A Guide for Archiving Web Pages >> Tools for archiving web pages