Loading
 

Searching Across Time: Issues and Opportunities for Full Text Search of Web Archives Using Nutch

Kris Carpenter Negulescu
Director, Web Archive
Internet Archive

For the past two years, the Internet Archive (IA) has used Nutch/Lucene open source tools to generate full text search indexes of archival Web content for the National Library of Australia, the Bibliothèque Nationale de France, the Library of Congress, and the National Archives and Records Admininstration (NARA). The IA has also produced an experimental search service for consumers of its own historic Web collections, The 20th Century Find, which encompasses content harvested from 1996-1999.

This presentation will review each of these case studies, including lessons learned to date searching Web archives at a scale of 100 million – 1+ billion URLs, and the challenges associated with searching across time, both technically as well as those specific to an end user’s experience of a Web archive. The case studies presented are specific to Nutch/Lucene implementations, but implications for searching archives in general will be the primary focus of this presentation.

http://www.archive.org/
http://lucene.apache.org/nutch/

 

Last updated:  Friday, November 2nd, 2012