Corey Davis
Digital Preservation Librarian
University of Victoria
Large Language Models (LLMs) are reshaping how research libraries manage digital preservation and provide access to web archives. This presentation explores the potential and challenges of integrating LLMs with Retrieval-Augmented Generation (RAG) to improve searchability and usability of Web ARChive (WARC) files. The session will showcase a RAG pipeline developed at the University of Victoria Libraries for processing and exploring web archives conversationally, and talk about the challenges inherent in managing AI infrastructure locally, data quality issues, embedding strategies, and computational requirements. Attendees will gain insights into the practical applications of AI for access to unique digital collections, the technical and ethical considerations involved, and strategies for optimizing AI-driven discovery tools in library contexts. The presentation will argue that while AI enhances access to web archives and digital collections more broadly, its successful deployment requires careful design, iterative refinement, and human oversight. Full background information is available here: Davis, C. (2025). Unlocking web archives: LLMs, RAG, and the future of digital preservation. University of Victoria Libraries. https://hdl.handle.net/1828/21379