Top Page >>A Guide for Archiving Web Pages

A Guide for Archiving Web Pages

header image

This guide for archiving web pages is a component of the Guide for Organizing, Cataloging, and Preserving Collections of Papers, Photographs, and Other Records. It is an introduction to a complex and constantly evolving subject for those who are responsible for relatively small numbers of closely-related web pages. Neither comprehensive nor professionally vetted, it is a brief overview by an interested amateur and is intended as a starting point for research on the topic.

A Guide for Archiving Web Pages

Resources

What is web archiving?

Web archiving aims to preserve selected sets of internet pages. Because the pages themselves are inherently impermanent -- subject to degradation, obsolescence, malicious destruction, and inadvertent alteration or deletion, they are preserved by replication and by migration from outmoded formats to up-to-date ones. The replicated pages are kept in entirety -- with accompanying format files, graphic images, and the like -- and are maintained on preservation computer servers in safe environments. Because no server or server environment can be made fully safe and because server storage is not costly, mirror sites are often used in widely separated geographic locations.
-- more on web archiving basics --

{Return to top of page}

How is web archiving done?

Web archiving includes practices for the creation and maintenance of pages to be archived, an initial archiving snapshot of the entire web site, detection and downloading of pages as they change over time, and maintaining the servers on which pages are saved and insuring that multiple copies are saved.

Creation and maintenance of pages to be archived

What are the best practices for pages to be archived? Strictly speaking, web archiving practices should be able to cope with any valid code for internet pages. In practice, however, it is highly desirable for pages to observe standard practices for accessibility, for separation of content and presentation, for meta tags to consistently describe contents, and for insuring content integrity. Formats and contents that hinder access by people with disabilities should be avoided. The XHTML format is preferred to HTML. Formatting via linked style sheets is preferable to formatting by tables or other outmoded means. A full set of preservation meta tags should be present on each page. Policies should be written for assuring content integrity.
-- more on best practices for pages to be archived --

Content management systems offer a relatively easy and effective means for creating and maintaining web site and the pages they contain. Because they foster consistency and adherence to standards, they have especial value where the pages are to be archived.
-- more on content management systems --

Archiving snapshot of the entire web site

A snapshot is the starting point for an ongoing web archiving program. Either manually or using a snapshot tool, the manager of a web site a copies the entire site, preserving its organization and all contents, and transmits the resulting files to the web archive servers. A snapshot is thus simply a faithful replication of the site as it existed at a particular point in time.
-- more on snapshot tools --

Detection and downloading of pages as they change over time

Web archiving involves the detection, capture, and storage of all types of web content including HTML web pages, style sheets, javascript, images, and video. If meta tags for the pages are kept in separate files rather than in the pages themselves, they are included as well. Web archivists also create and store metadata about the collected resources including time and date of collection, extent of pages collected, and encoding protocols used for transmission. These meta tags help assure the authenticity and provenance of the archived collection.

Although the pages to be archived can be manually saved and mirrored, it is best if the web archiving process is as automated as much as can be managed. The most common automated archiving is achieved by a process called web harvesting. Web harvesting software includes crawlers, intended for by large-scale operations such as the Internet Archive and national libraries, and curatorial tools for use by archives and other relatively smaller organizations. There are curatorial tools to both detect and download pages as they change over time.
-- more on web archiving tools --

Maintaining the servers on which pages are saved and insuring that multiple copies are saved

Servers are both hardware and software. The term encompasses the computers on which datafiles are stored, the operating systems of those computers, and the software for interacting with the datafiles. Server maintenance is mostly a matter of common sense. The server environment needs to be maintained as to the hardware and software. There should be a maintenance policy for regular surveys to assure that everything is working properly. Because the datafiles and the magnetic media that house them are inherently instable, web archives should be mirrored to at least one other site. Mirroring software can be used for automatic replication of web archives.
-- more on server maintenance and mirroring --

Web archiving does not include the practice of backing up the code of web pages as they are created and updated, but it assumes that this backing up takes place. Preferably there should be two or more backup copies of the entire website as it is made and altered. The backups should not all be on the same computer; at least one should be on a portable harddrive, backup tape cartridge, network-accessible server, or set of CD-ROM disks, and these external locations should be sustained in accordance with archival best practices.
-- more on backing up practices --

{Return to top of page}

How do we keep track of our archiving?

A web archiving program should include documentation of policies, guidelines, and standards.

Polices, procedures, guidelines, and standards can be formal or informal, published on a web site or kept in a notebook, maintained in draft or formally disseminated, but whatever form they take, they should be prepared. As with general archival policy statements, it is probably best to view them as works in progress, being constantly updated as time allows and circumstances dictate.

They should address all aspects of web archiving including practices for the creation, maintenance, and backing up of pages; policies and workflows for detection and downloading of pages as they change over time; and guidelines for maintaining the servers on which pages are saved and insuring that multiple copies are saved on mirror sites.
-- more on polices, procedures, and guidelines --

{Return to top of page}

Top Page >> A Guide for Archiving Web Pages