Meeting with Peter Chan, guest researcher at Enssib

25/10/2024

1) You are a Web Archivist at Stanford University Libraries. What are your responsibilities at the library?
As a web archivist, I collaborate with Stanford's bibliographers who select websites for preservation. Current web archiving technology can only capture certain parts of websites, which may lead to an expectation gap for bibliographers. Additionally, we rely on an external vendor to crawl the content and transfer it to our preservation repository. I coordinate issue resolution as problems arise during this process, and if the vendor cannot resolve the issues, I use other available tools to attempt a fix.

I also use Generative AI to summarize the content of each website as metadata for our catalog system. This ensures that researchers can discover archived websites alongside other library resources, such as books, databases, and journals. Currently, some platforms for archived websites offer full-text search to explore content, and I am exploring the use of WARC-GPT, an open-source tool that enables researchers to conduct semantic searches within web archives.

2) Between 2013 and 2019, you led the ePADD project on email archiving. What are the main challenges posed by email archiving?
As I mentioned in my talk to students at Enssib, "Exploring the Interdisciplinary Connections of Email Archiving: From Archival Studies to AI," email archiving requires knowledge from at least eight disciplines: Archival Studies, Digital Archaeology, Web Archiving, Data Privacy, Digital Preservation, Narrow AI, Social Network Analysis, and Generative AI. It's difficult for one person to be an expert in all eight disciplines, but it's crucial to recognize the complexity and collaborate with people who possess the necessary skills.

Stanford Named Entity Recognizer (NER), an AI tool for recognizing named entities in English, particularly for the categories PERSON, ORGANIZATION, and LOCATION, was first released on September 18, 2006. This tool is incredibly useful for archivists and researchers in extracting named entities from documents. However, asking historians to download Stanford NER and format data for the tool can be too complex.

The ePADD project integrates Stanford NER into a user-friendly package, making it seamless and easy for archivists and researchers to use. Implementing this integration requires close collaboration between archivists, AI experts, and programmers. This complexity is one reason why not many packages like ePADD exist to help archivists and researchers fully leverage the power of AI.

3) You are interested in AI applications within libraries and archives. Can you tell us more?
"More Product, Less Process" (MPLP) is a 2005 article by Mark A. Greene and Dennis Meissner advocating for minimal archival processing to reduce backlogs and speed up access to collections. This approach has been widely adopted, increasing accessibility without limiting future detailed processing. However, even with MPLP, archivists still face significant backlogs.

With AI tools available for tasks like summarization, facial recognition, and topic modeling, I propose a new approach: "More AI, More Product, More Feedback" for archival processing. Since most AI tools are trained on general texts, images, or videos, we must evaluate them carefully before implementation. Additionally, these tools may not perform well for long-tail cases. Researchers can play a crucial role in identifying issues, which is why we need a robust feedback system to address their concerns.

With the introduction of generative AI using foundation models (also known as large language models) and related technologies, we can now perform semantic search, which provides results that go beyond traditional full-text search capabilities. However, some AI tools are currently disconnected from our collections, creating challenges for researchers. We need to integrate AI tools more seamlessly into our collections to make them as transparent and user-friendly as possible.

Peter Chan is an archivist at Stanford University, specializing in digital archiving and former leader of the ePADD email archiving project. As Archivist at the University Libraries, he is currently working on applications of generative AI to library collections.