Expanding Access to Historic Scanned Documents Using R’s Tesseract Package
Adelynn Shirts
Open Science and Publishing Graduate Assistant
Utah State University
David Advent
Scholarly Communication Librarian
Utah State University
Utah State University’s Institutional Repository, DigitalCommons@USU, hosts over 100,000 PDF documents, many of which were originally printed pre-1975 and then later scanned. As such, they lack embedded text layers, rendering them inaccessible to screen readers without additional processing. A scalable pipeline was built to identify documents lacking embedded text and perform optical character recognition (OCR), making the content accessible to screen readers. Two preprocessing functions deskew, denoise, and enhance document clarity prior to performing OCR. Dictionary coverage from light and heavy preprocessing functions were compared: light preprocessing was computationally faster but resulted in less dictionary coverage, while heavy preprocessing added a modest amount of time and increased dictionary coverage slightly. After evaluating outputs, it was determined that the dictionary coverage of documents lacking embedded text layers were similar to those containing embedded text layers. While this doesn’t make documents exactly compliant with Americans with Disabilities Act standards, it is an important first step in working towards accessibility for older publications, especially considering the open source nature of the code and process.
https://github.com/ashirts/Expanding-Access-to-Historic-Scanned-Documents
Improving Accessibility and Discoverability Utilizing Open Source Models in a Novel Modular Design
Brian McBride
Associate Director of Digital Infrastructure Development
University of Utah
Harish Maringanti
Associate Dean for Research
University of Utah
Bohan Zhu
Web Software Developer
University of Utah
University libraries are under growing pressure to expand access and improve discovery while meeting new accessibility expectations for digital collections. At the University of Utah’s J. Willard Marriott Library, the digital infrastructure development team is building a flexible, modular workflow to help bring large audio-visual (AV) collections into alignment with the Department of Justice accessibility requirements at scale. The platform orchestrates open source speech-to-text and language models to generate time-aligned transcripts and captions, structured segmentation, word clouds, entity recognition, and descriptive metadata that improves both compliance and discoverability. The session will highlight the implementation approach, early results, and lessons learned including human review checkpoints, staff support and buy-in, provenance and auditability, and how adaptable workflows are being designed as models and standards evolve. The session will also focus on practical strategies other institutions can reuse to accelerate accessible AV delivery without locking into a single vendor or toolchain and the team’s future development plans for supporting other formats, including images, PDFs, and other formats.
Strategies for Responsible AI in Manuscript Transcription (Lightning Talk)
Sara Brumfield
Partner
FromThePage
FromThePage is a crowdsourcing platform for archives and libraries where volunteers transcribe, index, and describe historic documents. This talk will overview how the platform and community are making decisions that make the use of artificial intelligence (AI) in historical document transcription transparent, optional, and tentative, including topics such as:
- Optional usage of AI by transcribers and institutions
- Surfacing and logging use of AI drafts
- Provenance in exports showing both AI and human contributions
- Detecting unauthorized use of AI
- Measuring accuracy
http://www.fromthepage.com