Top Page >>A Guide for Archiving Web Pages >> Tools for archiving web pages

A Guide for Archiving Web Pages

Navigation

Pages in This Section

- What is web archiving
- How is web archiving done?
- - Web page best practices
- - Assuring accessibility
- - Using style sheets
- - Using meta tags
- - - Dublin Core metadata
- - Using a CMS
- - Assuring content integrity
- - Using a snapshot tool
- - Using web archiving tools
- - Server maintenance
- - Backing up
- Documentation

Detection and downloading of pages as they change over time

Web archiving involves the detection, capture, and storage of all types of web content including HTML web pages, style sheets, javascript, images, and video. If meta tags for the pages are kept in separate files rather than in the pages themselves, they are included as well. Web archivists also create and store their own separate sets of meta tags about the collected resources including time and date of collection, extent of pages collected, and encoding protocols used for transmission. These meta tags help assure the authenticity and provenance of the archived collection.

There are tools that can be used to both detect and download pages as they change over time.

In order to simply detect changes in pages, see Monitoring changes to web pages, an annotated list of detection tools from Rhodes-Blakeman Associates (2008). This list does not mention tools that work within specific browsers, such as the Firefox Update Scanner.

For downloading changed pages, see the short, annotated list of web harvesting tools in the Preservation of Web Resources Handbook, (pdf) from the University of London Computing Centre, pp. 23-27 (2008). The Harvesting Software subsection of the Tools section of Harvard University's Web Archiving Resources has an extensive list of tools with annotations and links to sources (2008). The National Archives and Records Administration (USA) has a similar annotated Resource List of harvesting tools (2005).

The most highly regarded harvesting tool for general use is the HTTrack Website Copier. According to its homepage, HTTrack "allows you to download a World Wide Web site from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site's relative link-structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI."

More useful than HTTrack, however, is the set of integrated harvesting tools called the Web Curator Tool (WCT). This higly-regarded and widely-used project is an open-source, free-of-charge toolset "for managing the selective web harvesting process. It is designed for use in libraries and other collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process."

WCT can capture whole web sites as well as individual web pages. It harvests a broad range of file types including HTML pages, images, PDF and Word documents, as well as multimedia content such as audio and video files. It supports selection of pages, requests for permission to harvest (if needed), description of pages, determination of scope and boundaries, the scheduling of harvests, the actual harvesting, performance of quality reviews, and deposit of harvested pages in a digital repository or archive.

Some resources on detection and downloading of web pages

Monitoring changes to web pages, an annotated list of detection tools from Rhodes-Blakeman Associates (2008).
Update Scanner, a FireFox add-on monitoring tool.
Preservation of Web Resources Handbook, (pdf) from the University of London Computing Centre, pp. 23-27 (2008).
Tools, a section of Harvard University's Web Archiving Resources pages (2008).
Resource List of harvesting tools from the National Archives and Records Administration, USA (2005).
HTTrack Website Copier, a harvesting tool that is easy to install and use.
Web Curator Tool, an easy to use, but not easy to install, comprehensive web harvesting toolset.
A Year of Selective Web Archiving with the Web Curator at the National Library of New Zealand, by Gordon Paynter et al; D-Lib Magazine, May/June 2008, Volume 14 Number 5/6. "The Web Curator Tool is an open-source tool for managing selective web archiving developed as a joint project between the National Library of New Zealand and the British Library. It has now been in everyday use at the National Library of New Zealand since January 2007. This article describes our first year of selective web archiving with the new tool. The National Library of New Zealand is reaping the benefits of the Web Curator Tool development and will continue our selective harvesting program with the Web Curator Tool for the foreseeable future."

A Guide for Archiving Web Pages

Navigation

Pages in This Section

Detection and downloading of pages as they change over time

Some resources on detection and downloading of web pages

Comments or corrections?

Credits