CNI: Coalition for Networked Information

  • About CNI
    • Membership
    • CNI Collaborations
    • Staff
    • Steering Committee
    • CNI Awards
    • History
    • CNI News
  • Program Plan
    • Current Program Plan
    • Program Plan Archive
  • Topics
  • Events & Projects
    • Membership Meetings
    • Workshops & Projects
    • Other Events
    • Event Calendar
  • Resources
    • Publications by CNI Staff
    • Program Plan
    • Pre-Recorded Project Briefing Series
    • Videos & Podcasts
    • Follow CNI
    • Historical Resources
  • Contact Us

Creating Topical Collections: Web Archives vs. the Live Web

Home / Topics / Assessment / Creating Topical Collections: Web Archives vs. the Live Web

December 10, 2017

Martin Klein
Scientist, Research Library
Los Alamos National Laboratory

Creating collections of web pages related to significant events such as natural disasters or terror attacks has gained importance over the last few years. Not only academic digital library and special collection departments but also historians and social scientists apply various techniques to collect resources of interest. A common approach is to deploy a focused web crawler that targets web pages highly relevant to the event. However, with the dynamic and fast-paced nature of the web, the timing of these crawls becomes a critical issue. In other words, if too much time has elapsed since the event, the live web may not be the best source for such crawls anymore. We are hypothesizing that creating a web collection about an unplanned event (e.g., a terror attack) some time after the event happened is better done as a focused crawl of web archives than on the live web. With today’s landscape of web archiving institutions and protocols to access their holdings simultaneously i.e., Memento, we are able to create highly relevant web collections of events from the past. In this talk I will present preliminary results of our collaborative study (together with Herbert Van de Sompel and Lyudmilla Balakireva) into focused crawls of web archives vs. the live web. I will detail our methodology, show the precision of web archive crawls, showcase the benefits of utilizing multiple web archives at once, and contrast our findings with cost factors such as crawl time. Our results aim to support anyone interested in creating high-quality topical web collections or refining existing broad archival collections and make a strong argument for the merit of (utilizing) multiple web archives.

Presentation

Share this:

  • Click to share on Facebook (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)

Filed Under: Assessment, CNI Fall 2017 Project Briefings, Information Access & Retrieval, Project Briefing Pages
Tagged With: cni2017fall, Project Briefings & Plenary Sessions

Last updated:  Monday, December 18th, 2017

 

Contact Us

21 Dupont Circle
Suite 800
Washington, DC, 20036
202.296.5098

Contact us
Copyright © 2023 CNI

  • Copyright Policy
  • Privacy Policy
  • Site map

Keeping up with CNI

CNI-ANNOUNCE is a low-volume electronic forum used for information about the activities and programs of CNI, and events and documents of interest to the CNI community.
Sign up

Follow CNI

  • View cni.org’s profile on Facebook
  • View cni_org’s profile on Twitter
  • LinkedIn
  • YouTube
  • Vimeo

A joint project