Loading
 

Searching Across Time: Issues and Opportunities for Full Text Search of Web Archives Using Nutch

Kris Carpenter Negulescu
Director, Web Archive
Internet Archive

For the past two years, the Internet Archive (IA) has used Nutch/Lucene open source tools to generate full text search indexes of archival Web content for the National Library of Australia, the Bibliothèque Nationale de France, the Library of Congress, and the National Archives and Records Admininstration (NARA). The IA has also produced an experimental search service for consumers of its own historic Web collections, The 20th Century Find, which encompasses content harvested from 1996-1999.

This presentation will review each of these case studies, including lessons learned to date searching Web archives at a scale of 100 million – 1+ billion URLs, and the challenges associated with searching across time, both technically as well as those specific to an end user’s experience of a Web archive. The case studies presented are specific to Nutch/Lucene implementations, but implications for searching archives in general will be the primary focus of this presentation.

http://www.archive.org/
http://lucene.apache.org/nutch/

 

Supporting Cyberscholarship in American Social History: The DLF Aquifer Story

Katherine Kott
Digital Library Federation Aquifer Director
Digital Library Federation

This session will provide a snapshot of the services now available through DLF Aquifer’s Andrew W. Mellon Foundation funded American Social History Online project, including a new Web-services portal that integrates with Zotero. The tool was designed with search engine optimization in mind and takes advantage of “asset actions” to represent distributed collections as if they were one. In addition to direct services to the scholar, Aquifer is committed to modeling distributed, network-based, collaborative solutions to share with the community. American Social History Online is being built by a geographically distributed team with open source software using an agile development method adapted from industry for the academic environment.

Attendees will learn about the processes in place for development, evaluation and assessment as well as how to contribute collections and why contributing collections makes sense. The session will conclude by describing the next phases of the project, including an open source federated search solution that will integrate American Social History Online content with licensed commercial content and a Sakai integration that will enable content to be annotated and saved, using Sakai tools.

http://www.diglib.org/aquifer/
http://wiki.dlib.indiana.edu/confluence/display/DLFAquifer/DLF+Aquifer+Wiki
http://sourceforge.net/projects/dlf-aquifer/
http://www.personalbee.com/1967
http://www.dlfaquifer.org

Handout (MS Word)

PowerPoint Presentation

 

A Survey and Evaluation of Open-Source Electronic Publishing Systems

Mark Cyzyk
Scholarly Communication Architect
The Johns Hopkins University

In late 2006 the Library Digital Programs Group of The Sheridan Libraries at Johns Hopkins University was commissioned by the Open Society Institute (OSI) to perform a survey and evaluation of open-source electronic publishing systems.  This evaluation fits nicely with Hopkins’ interests in repository analysis and evaluation, long-term/large-scale data curation, and linkages between primary publications and related datasets, as well as with the interests of the Johns Hopkins University Press. In this presentation, a report on the methodology and results of this study will be provided, including an evaluation of DPubS, GNU Eprints, Hyperjournal, and Open Journal Systems, as well as commentary on Connexions/Rhaptos, DiVA, and Topaz.

The study itself will be published as a white paper by the Open Society Institute.

Handout (MS Word)

 

An Update from OAI Object Re-Use & Exchange

Herbert Van de Sompel
Digital Library Researcher
Los Alamos National Laboratory
Michael Lloyd Nelson
Assistant Professor
Old Dominion University
Carl Lagoze
Senior Researcher, Information Science Program
Cornell University
Robert Sanderson
Lecturer in Computer Science, UK National Centre for Text Mining
University of Liverpool

In September 2006, the Open Archives Initiative (OAI) launched the Object Re-Use and Exchange (ORE) effort, aimed at defining an interoperability framework that would allow leveraging the intrinsic value of scholarly digital objects beyond the borders of their hosting repositories, and turn these objects into the core of a Web 2.0 style scholarly communication flow. A major challenge in achieving these goals in a manner that fully leverages the Web Architecture relates to the compound nature of scholarly digital objects. Indeed, these objects typically are aggregations of multiple related resources that together form a logical intellectual whole. The Web architecture, however, only recognizes atomic resources, and does not natively support the aggregation concept.

As a result, a major focus of the international OAI-ORE Technical and Liaison Committees has been on defining a model for compound digital objects that is fully aligned with the Web architecture. The proposed ORE Model leverages concepts from the Semantic Web activities such as named graphs and linked data. It introduces the notion of a Resource Map that describes a finite set of resources (the resources in the aggregation), their types, intra-relationships, and relationships with resources external to this finite set.

Another focus has been on devising a serialization format for the ORE Model that is fully machine-readable and has a realistic chance of adoption. An Atom-compliant serialization – the Resource Map Profile of Atom – has been specified. In this serialization, an aggregation (e.g. a compound digital object) is represented as an Atom feed document. The ORE Model can also be serialized in RDF/XML, and a GRDDL-based approach to transform Atom feed documents into RDF/XML documents compliant with the ORE Model has been devised. Moreover guidelines have been put forward to support discovery of Resource Maps.

Around the time of this presentation, alpha versions of OAI-ORE specifications covering the aforementioned aspects will be published, and international feedback and experimentation will be encouraged. The insights gained from this feedback process will be taken into account for the publication of version 1 OAI-ORE specifications, which is planned for September 2008. This presentation will provide an overview of the ORE approach and will provide an opportunity to discuss the proposed direction with the presenters.

Open Archives Initiative Object Re-Use & Exchange is supported by The Andrew W. Mellon Foundation, with additional support from Microsoft Corporation and the National Science Foundation.

http://www.openarchives.org/ore

Handout (PDF)

 

Update on Key Copyright Developments in the U.S. with a Focus on 108 Study Group

James G. Neal
Vice President for Information Services and University Librarian
Columbia University

Copyright continues to be a core interest of the higher education and academic library communities. This briefing will focus on eight critical legislative and legal arenas where the U.S. will be working on copyright: orphan works, digital fair use, broadcast flag, section 1201 anti-circumvention rulemaking, electronic reserves, peer-to-peer file sharing, open access to government funded research, and the report of the section 108 study group on exceptions and limitations for libraries and archives. The work of the 108 study group will be highlighted, including its primary findings and recommendations. In addition, two important recent studies will be described and their importance for libraries will be cited:

  • Promoting Innovation and Economic Growth: The Special Problem of Digital Intellectual Property (Committee for Economic Development, 2004)
  • Fair Use in the U.S. Economy: Economic Contribution of Industries Relying on Fair Use (Computer and Communications Industry Association, 2007)

The advocacy and educational roles and responsibilities of librarians on copyright also will be outlined.

http://connect.educause.edu/library/abstract/PromotingInnovationa/35540
http://www.ccianet.org/artmanager/publish/news/First-Ever_Economic_Study_Calculates_Dollar_Value_of.shtml

Handout (PowerPoint)

 

Update on Scholarly Communication Efforts by Microsoft Corporation

Lee Dirks
Director, Scholarly Communications
Microsoft Corporation

Over the past 18-24 months, Microsoft Corporation has become far more involved in the academic and scholarly communication space, and is actively engaging in projects with libraries, archives, scientists and scholarly publishers worldwide. An overview of these efforts will be presented, as well as an explanation of Microsoft’s strategy in this area, and some future efforts/developments currently in process will be illustrated. Attendees are encouraged to provide real-time feedback on the presentation and should look at this as an opportunity to help influence the future direction of Microsoft’s efforts.

http://www.microsoft.com/science

PowerPoint Presentation

 

The Report of the MLA Task Force on Evaluating Scholarship for Tenure and Promotion: One Year On

David E. Laurence
Director of Research and ADE
Modern Language Association
David Nicholls
Director of Book Publications
Modern Language Association

In December 2006, the Modern Language Association (MLA) released the report of its Task Force on Evaluating Scholarship for Tenure and Promotion. The Task Force reported the results of a spring 2005 online survey of 1,339 departments in 734 institutions across the United States covering a range of doctorate, master’s, and baccalaureate institutions. On the basis of the survey, a review of relevant reports, studies, and documents, and extensive consultation with academic administrators, MLA members, and publishers, the committee set forth 20 recommendations. In this session, a presentation of the response the report has received from a variety of scholarly and professional organizations will be discussed, and input will be solicited on how the report does or does not address the particular concerns of session attendees. Meeting attendees are encouraged to review the report (including an executive summary) on the MLA Web site prior to the session.

http://www.mla.org/tenure_promotion