Hall-Hoag at Scale: Taming 800,000 Pages
Birkin Diana
Lead Developer, Digital Technologies, Library
Brown University
Justin Uhr
Library Web and Applications Developer, Digital Technologies, Library
Brown University
In 2023, Brown University’s special collections John Hay Library received a grant to scan some 800,000 pages of materials from some 35,000 organizations of “The Hall-Hoag Collection of Dissenting and Extremist Printed Propaganda,” and ingest them into Brown’s Digital Repository. Due to volume and timeline, the individual items ingested were linked to their organizations. The briefing will share (1) current “phase-1: ingestion” work involving optical character recognition research, rotation-detection, organizational-metadata production, and a flexible ingestion pipeline; and (2) preliminary “phase-2: enhancement” investigations involving using large language models for summarization to improve item-level MODS, using multimodal embedding models to group individual items into multi-page documents, and exploring approaches to make the collection discoverable without results dominating all repository searches.
- Hall-Hoag collection brief overview: https://library.brown.edu/collatoz/info.php?id=62
- Brown Digital Repository Hall-Hoag collection page: https://repository.library.brown.edu/studio/collections/bdr:wum3gm43/
- Hall-Hoag collection finding-aid-database website:
https://apps.library.brown.edu/hall-hoag/
Processing Library Collections at Scale for Broad Research and Artificial Intelligence Training
Catherine Brobston
Program Director, Institutional Data Initiative, Law Library
Harvard University
Greg Leppert
Executive Director, Institutional Data Initiative, Law Library
Harvard University
The Institutional Data Initiative at the Harvard Law School Library is working to build library capacity while improving the available training data for artificial intelligence (AI). In June, the Library released Institutional Books, a dataset of nearly one million public domain volumes from Harvard Library, scanned through its partnership with Google. Next year, the Library will release a dataset of newspapers in partnership with the Boston Public Library. In processing library datasets, the goal is to release structured and refined collections that improve usability for humans and machines alike. The briefing will include insights on working with library data at scale, applying AI tools to build reliable pipelines, learning from a community of academic and AI community users, and working within institutional systems to create room for this work.
https://www.institutional.org/