Software and Processes
The Digital Projects Unit's web archiving techniques.
- Archiving Process
- Current Challenges and Limitations
- Crawl Configuration
- Crawl Scope
There are two main pieces to our web archiving process: harvesting the live web content and providing access to the resulting archived files.
We use Wayback, another piece of open source software from the Internet Archive, to display the archived content on the Web. This application relies on client and server-side scripts to rewrite links, so web requests are made for documents in the archive's WARC files rather than trying to pull content from the live Web.
We begin by examining the site(s) that we will be archiving, looking for areas we do or do not want to capture and identifying potential crawler traps. We browse the site manually as a user might and also look at source code when crawler access appears questionable. If necessary, we write scripts to extract elusive URIs that may then be added to the crawl's seed list.
With knowledge gained from site examination, we program our crawler to follow rules that instruct it to harvest content that we have deemed to be within our desired scope.
Next we do a test crawl to verify that our crawler has been configured in a way that allows us to download all of the URIs needed to render the archived site true to its live version. Completing a test capture also indicates how much time it should take to execute the final crawl which depends on factors such as the amount of content, how it is organized, and any delays needed to keep from overwhelming the target server.
Once the crawl is complete, we create a CDX index for the downloaded items stored in the WARC files. When we configure Wayback to use this index and a second index that maps the WARC file locations, the captured web site is viewable and can be checked for quality. We do this check by running Wayback in Proxy Replay mode, where Wayback serves as an HTTP proxy server, and using tools such as HttpFox and Live HTTP Headers to facilitate discovery of missed content.
If we missed desired areas of content, we modify our crawl configuration and execute another crawl to obtain the documents we lack.
Downloaded content is stored in WARC files. We do not manipulate these files once they are written, allowing us to keep a true record of a site at the time of its capture.
Although there is an active community developing and improving upon the tools and methods used in web archiving, there continues to be a common set of problems encountered during the process.
External links and externally hosted media, such as video, may be problematic to harvest since we must rely on third parties to supply the files. Even when we are able to download the media content files, some embedded media players do not function properly when replaying a site. Sites that embed media but also provide a link to a direct download of the media file help to ensure that users will be able to access these files from the archive.
When configuring a crawler for a harvest, we consider settings, including:
- How many threads (processes) should be run at once
- How long the crawler should wait before retrievals
- How many times URIs should be retried
- Whether or not the crawler should comply with robots.txt
- In what format downloaded content should be written
The settings we apply change from crawl to crawl based on factors such as the current resources/hardware we have available, what permission we have gained to harvest a site, time limitations in place, and our goals for a specific capture. Two settings that remain consistent for every crawl specify our crawl operator information. Our crawler informs the web servers it visits of a URL where a webmaster noticing traffic from us may visit to read about our crawling activity. Additionally, we provide an e-mail address, so if a webmaster finds our crawler causing trouble for his or her servers, such as by making too many requests too quickly, he or she is able to contact us about the issue.
During configuration, we also define scope rules for the crawler to follow. Some of our most commonly applied rules are:
- Accept URIs based on SURT prefix
- Accept and reject URIs based on regular expressions
- Reject URIs based on too many path segments (potential crawler trap)
- Accept URIs based on number of hops from seed
- Use a transclusion rule that accepts embedded content hosted by otherwise out of scope domains
- Accept a URI based on an in-scope page linking to it
- Use a prerequisite rule that accepts otherwise out-of-scope URIs that are required to get something that is in scope