CNI: Coalition for Networked Information

  • About CNI
    • Membership
    • Staff
    • Steering Committee
    • CNI Awards
    • History
    • CNI News
  • Membership Meetings
    • Next Meeting
    • Past Meetings
    • Future Meetings
  • Topics
  • Events & Projects
    • Membership Meetings
    • Workshops & Projects
    • Other Events
    • Event Calendar
  • Resources
    • CNI Publications
    • Program Plan
    • Pre-Recorded Project Briefing Series
    • Videos & Podcasts
    • Executive Roundtables
    • Follow CNI
    • Historical Resources
  • Contact Us

From Scan to Discovery: Responsible AI and Open Source Strategies for Document and AV Access

Home / Topics / Access & Equity / From Scan to Discovery: Responsible AI and Open Source Strategies for Document and AV Access

March 25, 2026

Expanding Access to Historic Scanned Documents Using R’s Tesseract Package

Adelynn Shirts
Open Science and Publishing Graduate Assistant
Utah State University

David Advent
Scholarly Communication Librarian
Utah State University 


Utah State University’s Institutional Repository, DigitalCommons@USU, hosts over 100,000 PDF documents, many of which were originally printed pre-1975 and then later scanned. As such, they lack embedded text layers, rendering them inaccessible to screen readers without additional processing. A scalable pipeline was built to identify documents lacking embedded text and perform optical character recognition (OCR), making the content accessible to screen readers. Two preprocessing functions deskew, denoise, and enhance document clarity prior to performing OCR. Dictionary coverage from light and heavy preprocessing functions were compared: light preprocessing was computationally faster but resulted in less dictionary coverage, while heavy preprocessing added a modest amount of time and increased dictionary coverage slightly. After evaluating outputs, it was determined that the dictionary coverage of documents lacking embedded text layers were similar to those containing embedded text layers. While this doesn’t make documents exactly compliant with Americans with Disabilities Act standards, it is an important first step in working towards accessibility for older publications, especially considering the open source nature of the code and process.

https://github.com/ashirts/Expanding-Access-to-Historic-Scanned-Documents

 

Improving Accessibility and Discoverability Utilizing Open Source Models in a Novel Modular Design

Brian McBride
Associate Director of Digital Infrastructure Development
University of Utah

Harish Maringanti
Associate Dean for Research
University of Utah

Bohan Zhu
Web Software Developer
University of Utah


University libraries are under growing pressure to expand access and improve discovery while meeting new accessibility expectations for  digital collections. At the University of Utah’s J. Willard Marriott Library, the digital infrastructure development team is building a flexible, modular workflow to help bring large audio-visual (AV) collections into alignment with the Department of Justice accessibility requirements at scale. The platform orchestrates open source speech-to-text and language models to generate time-aligned transcripts and captions, structured segmentation, word clouds, entity recognition, and descriptive metadata that improves both compliance and discoverability. The session will highlight the  implementation approach, early results, and lessons learned including human review checkpoints, staff support and buy-in, provenance and auditability, and how adaptable workflows are being designed  as models and standards evolve. The session will also focus on practical strategies other institutions can reuse to accelerate accessible AV delivery without locking into a single vendor or toolchain and the team’s future development plans for supporting other formats, including images, PDFs, and other formats.

Strategies for Responsible AI in Manuscript Transcription (Lightning Talk)
Sara Brumfield
Partner
FromThePage 

FromThePage is a crowdsourcing platform for archives and libraries where volunteers transcribe, index, and describe historic documents. This talk will overview how the platform and community are making decisions that make the use of artificial intelligence (AI) in historical document transcription transparent, optional, and tentative, including topics such as:

  • Optional usage of AI by transcribers and institutions
  • Surfacing and logging use of AI drafts
  • Provenance in exports showing both AI and human contributions
  • Detecting unauthorized use of AI
  • Measuring accuracy

    http://www.fromthepage.com

Share this:

  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Mastodon (Opens in new window) Mastodon
  • Share on Bluesky (Opens in new window) Bluesky
  • Share on X (Opens in new window) X

Filed Under: Access & Equity, Artificial Intelligence, CNI Spring 2026 Project Briefing, Cyberinfrastructure, Digital Humanities, Digital Libraries, Digital Preservation, Electronic Theses & Dissertations (ETDs), Emerging Technologies, Information Access & Retrieval, Project Briefing Pages, Repositories, Scholarly Communication, Spaces
Tagged With: cni2026spr, Project Briefings & Plenary Sessions

Last updated:  Wednesday, March 25th, 2026

 

Contact Us

1025 Connecticut Ave, NW #1200
Washington, DC 20036
202.296.5098

Contact us
Copyright © 2026 CNI

  • Copyright Policy
  • Privacy Policy
  • Site map

Keeping up with CNI

CNI-ANNOUNCE is a low-volume electronic forum used for information about the activities and programs of CNI, and events and documents of interest to the CNI community.
Sign up

Follow CNI

LinkedInBlueSkyFacebookTwitterYouTubeVimeoMastodon

A joint project