Building and Evaluating Evals for Retrieval

The blog post discusses the use of Parea AI's pre-built evaluation for reference-free measuring of the hit rate of a retrieval setup, using Lantern, a Postgres vector database and toolkit. The Asclepius Clinical Notes dataset is used for the experiment, which includes synthetic physician summaries from clinical settings. The retrieval system is defined using BAAI's bge-base-en-v1.5 and OpenAI's embedding models. The evaluation of the retrieval system's hit rate is done using an LLM-based eval metric, which is a reference-free method.

The results of the experiment show that the pre-built evaluation metric aligns well with the measured hit rate, achieving 83% on the Q&A subset and 53% on the Paraphrasing subset. However, adding few-shot examples doesn't necessarily improve performance and can even hurt it on the Q&A subset. The use of chain-of-thought in JSON mode, requiring the `thoughts` field in the response, showed a positive effect on the Q&A subset and significant improvements on the Paraphrasing subset. The post concludes that Parea + Lantern is a fast and easy way to get started with AI applications.

Key takeaways:

The article discusses the use of Parea AI's pre-built eval for reference-free measuring of the hit rate of retrieval setup, using Lantern, a Postgres vector database and toolkit.
The Asclepius Clinical Notes dataset of synthetic physician summaries from clinical settings is used for the experiment, with the aim of assessing the performance of the embedding model on different subsets.
The article also explores the concept of reference-free evaluation of Hit Rate of a retrieval system using an LLM-based eval metric, and how few-shot examples can be used to improve the accuracy of the LLM-based eval metric.
Results showed that the pre-built evaluation metric approximates hit rate well without requiring labeled data, and that prompt engineering techniques such as few-shot examples & chain-of-thought in JSON mode can be effective when the performance of the evaluation metric isn't high on a particular subset.

Building and Evaluating Evals for Retrieval - Parea AI

Key takeaways:

Comments (0)

Newsletter