Loading
 

Documenting Internet2: Using Heritrix for Focussed Web Crawl

Eric Celeste
Associate University Librarian for Information Technology
University of Minnesota

When looking for ways to capture documentation of the Internet2 as an organization as part of the NHPRC-funded Documenting Internet2 project, we soon determined that a web crawl would provide one very helpful pool of information.  This is the story of how such a crawl was accomplished using the Heritrix crawler from the Internet Archive.  Combining Heritrix (a Java-based crawler) with a dash of Perl, JavaScript, PH, and MySQL, we created both an online snapshot of I2 and a searchable representation on a local file system.  Very much an experimental crawl, this process revealed a number of challenges to capturing web sites for archival purposes.

http://wiki.lib.umn.edu/DI2/HomePage

PowerPoint Presentation

Last updated:  Monday, April 29th, 2013