Eric Celeste
Associate University Librarian for Information Technology
University of Minnesota
When looking for ways to capture documentation of the Internet2 as an organization as part of the NHPRC-funded Documenting Internet2 project, we soon determined that a web crawl would provide one very helpful pool of information. This is the story of how such a crawl was accomplished using the Heritrix crawler from the Internet Archive. Combining Heritrix (a Java-based crawler) with a dash of Perl, JavaScript, PH, and MySQL, we created both an online snapshot of I2 and a searchable representation on a local file system. Very much an experimental crawl, this process revealed a number of challenges to capturing web sites for archival purposes.