|Zack E. Murrell
Director of the Southeast Regional Network of Expertise and Collections
Appalachian State University
Coordinator, SunSITE, Customer Technology Support
University of Tennessee
|Robert K. Peet
Professor and Chair
University of North Carolina, Chapel Hill
Professor of Information
University of Tennessee
Assistant Director for Digital Libraries
|Indra Neil Sarkar
Informatics Manager, MBLWHOI Library
Semantics Manager, EOL Biodiversity Informatics
Marine Biological Laboratory
Building a Distributed Digital Repository of Biological Data: Special Challenges(Murrell, et al)
Owing to the richness of the Earth’s biodiversity, the life sciences have historically been focused primarily on innovation and discovery rather than standardization. The emerging semantic Web and concomitant social networking applications are now enabling biodiversity scientists to develop and employ standards that allow unambiguous communication about types of organisms, their attributes, and distributions. However, the enormous amount of non-standardized legacy data has now become an impediment at a time when the life sciences are gathering and analyzing data at regional and continental scales to address such critical issues such as biodiversity conservation, habitat fragmentation and global warming.
The process of identifying types of organisms, whether in the wild or as species and museum specimens, is at the crux of the development of standards. The scientific community needs to reach a consensus regarding the distribution of information in a format that is standard compliant. In addition to these specific challenges, this community also faces the same challenges other digital initiatives encounter: the conversion of legacy data, digitization of a wide variety of media, the interweaving of resources, long-term storage and preservation of large datasets, and shared management of distributed assets. The Southeast Regional Network of Expertise and Collections (SERNEC) is a National Science Foundation (NSF) funded Research Coordination Network (RCN) that organizes a community to facilitate achievement of these goals at a realizable yet powerful scale. This regional consortium is a “virtual community” of herbarium curators that can be a model for other life science networks developing around the world. This grassroots network provides an electronic federated database that ideally seeks to disseminate all organism occurrences and attribute data in a compliant format, and provide critical tools for integrating large datasets with organisms identified using divergent taxonomic standards. This goal will be reached by working in concert with the larger global arena, led by the international efforts of the Taxonomic Database Working Group (TDWG) or Biodiversity Information Standards.
The SERNEC virtual community is supported by the image and digital object repository of the NSF funded Morphbank project. The Morphbank repository provides tools that allow users to collect and annotate objects in order to illustrate the complex relationships between them. Morphbank facilitates interactions among the virtual community members by providing a shared environment for capturing discussions about the underlying biological issues. As this database is built, it will be reviewed by the collective taxonomic expertise of this virtual community, resulting in an increasingly accurate portrayal of the biogeography of this region.
The SERNEC virtual community is able to leverage the expertise of information scientists, social scientists, educators, and artists, as well as the region’s curators, to ensure data is reliable and authoritative. As collaborations within this virtual community grow, and as innovations such as interactive keys and mapping of biotic and abiotic components of the landscape develop, we will provide complex information in intuitively understandable fashion to various user groups, and therefore stimulate interest in plant systematics and biogeography. On a larger scale, this cooperative network will be capable of addressing the critical worldwide issues of habitat destruction, species loss, and global climate change.
Handout (MS Word)
The Biodiversity Heritage Library: A Knowledge Domain Enterprise (Garnett & Sarkar)
Ten major natural history museum libraries, botanical libraries, and research institutions are joined in a collaborative effort to digitize legacy biodiversity literature in an open access manner. From this partnership grew the Biodiversity Heritage Library (BHL) project. The partners envision that any research scientist or student who has access to the Internet, located anywhere in the world, will be able to search for specific information in all of the literature relevant to biodiversity and transparently link into relevant taxonomic, geographic, or other useful databases. Such a tool would erase much of the expensive, labor-intensive work of library research and speed the production of research results many times over.
Why digitize this literature? The ten partner libraries collectively hold a substantial part of the world’s published knowledge on biological diversity. Yet this wealth of knowledge is available only to those few who can gain direct access to these collections. This body of biodiversity knowledge is thus effectively sequestered from wide use for a broad range of applications, including research, education, taxonomic study, biodiversity conservation, protected area management, disease control, and maintenance of diverse ecosystems services. Much of this published literature is rare or has limited global distribution and is available in only a few select libraries. From a scholarly perspective, these collections are of exceptional value because the domain of systematic biology depends — more than any other science — upon historic literature. To positively identify a rare specimen, a working biologist may have to consult a 100 year-old text because that was the last time the organism was found, recorded, and described. Building on existing tools and services developed at the Universal Biological Indexer and Organizer (uBio) to index organism names (NameBank; which contains over 10 million name strings) and their associated hierarchies (ClassificationBank; which contains over 80 classifications), “taxonomic intelligence” will be integrated into the documents immediately as they are digitized using an established named entity recognition tool, TaxonFinder. The integration of taxonomic intelligence, via links from name strings located within each text file generated will enable linkages to other relevant indexed content other web-accessible name-based sources. The types of organisms that are associated with each digitized document will be characterized using the taxonomic groupings reflected in ClassificationBank. This will include the generation of descriptive statistics pertaining to organisms relative to other comparative axes (e.g., temporal or geographic). Ultimately, a complete list of name strings as they appear in each digitized document, reconciled contemporary form of name string, and any other relevant metadata will be incorporated into the index files and used to facilitate knowledge integration across a range of relevant biological databases.