Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI | Library Innovation Lab

Jun 08, 2024 - lil.law.harvard.edu
The Library Innovation Lab has released WARC-GPT, an open-source tool that uses AI to explore web archiving. The tool allows users to create custom chatbots that use a set of web archive files as their knowledge base, enabling exploration of collections through conversation. Users can ask specific questions in natural language against a collection of WARC files, with the tool providing a new starting point for search using multi-document full-text search with summarization. The tool also lists the sources used to generate the response and relevant text excerpts, which can be used to verify the information provided and identify points of interest within a collection of web archives.

WARC-GPT is a Retrieval Augmented Generation (RAG) pipeline, which allows for the creation of a knowledge base out of a set of documents, in this case, WARC files. This knowledge base is then used to help answer questions posed to a Large Language Model (LLM) of the user’s choosing. The tool was designed with both high customizability and transportability in mind, with settings, models, and prompts meant to be interchanged and experimented with. The tool can be run locally against open-source models, but it can also interact with closed-source LLM APIs such as Open AI or Anthropic, if API keys are provided in its configuration file.

Key takeaways:

  • WARC-GPT is an open-source tool that uses Retrieval Augmented Generation to create custom chatbots that use web archive files as their knowledge base, allowing users to explore collections through conversation.
  • It provides a new starting point for search using multi-document full-text search with summarization to explore the contents of web archives, listing the sources used to generate the response and relevant text excerpts.
  • WARC-GPT can be used with a variety of Large Language Models (LLMs), allowing archivists and researchers to use a chatbot that has knowledge of their collections, especially helpful for exploring private collections of WARCs or those that were not part of the training data for an LLM.
  • The tool was tested with a small thematic collection related to the lunar landing missions of India and Russia in 2023, and it was found that WARC-GPT was able to provide compelling answers in most cases, and made appropriate use of the provided sources, most of which were relevant.
View Full Article

Comments (0)

Be the first to comment!