|David S.H. Rosenthal
Chief Scientist, LOCKSS Program
Director, LOCKSS Program
A single replica of a large database, such as the petabyte scale protein database, may cost millions of dollars. Minimizing the number of replicas needed to assure adequate preservation becomes the dominant design goal. As preserving large scientific datasets becomes a focus of the National Science Foundation’s cyber-infrastructure program, how well prepared are we to take rational investment decisions about systems in this area?
Drawing on research by the LOCKSS research team and others, this session surveys the state of engineering knowledge and points out the gaps that need to be filled by digital preservation research. These gaps include needs for better specification and characterization of media performance (recent papers show that everything you know about disk reliability is wrong), better models of the threats (the most frequent cause of data loss at large sites is operator error), better models of fault tolerance (recent papers show that both RAID and Byzantine Fault Tolerance are inappropriate models) and better ways of formulating the relationship between a preservation service and its customers (disclaiming all responsibility for the preserved data is not a suitable service level agreement).
PowerPoint Presentation (PDF)