Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

Jeffrey and Kritin are the creators of Confident AI, a cloud platform designed to enhance the evaluation and unit-testing of LLM applications through their open-source package, DeepEval. DeepEval is already used in CI/CD pipelines by companies like BCG and AstraZeneca, running over 600K evaluations daily. However, to improve the user experience beyond just running evaluations, Confident AI offers additional features such as a dataset editor, regression catcher, and iteration insights. These tools help users inspect failing test cases, identify regressions, and select the best model or prompt combinations. The platform supports RAG pipelines, agents, and chatbots, allowing companies to switch LLMs, rewrite prompts, and keep test sets synchronized with their codebase.

Despite its capabilities, DeepEval's primary evaluation method, LLM-as-a-judge, faces consistency challenges. To address this, Confident AI introduced a DAG metric, a decision-tree-based approach that provides deterministic results by breaking test cases into atomic units. This metric is particularly effective in scenarios with clearly defined success criteria, like text summarization. Although still in its early stages, the DAG metric aims to offer reliable, code-driven, open-source metrics for LLM benchmarking. Confident AI is available on a freemium tier, with a temporary waiver on the requirement for a work email signup.

Key takeaways:

Confident AI is a cloud platform built around DeepEval, an open-source package for evaluating and unit-testing LLM applications.
The platform includes features like a dataset editor, regression catcher, and iteration insights to enhance LLM evaluation and benchmarking.
Confident AI aims to provide reliable benchmarking by using a new DAG metric for deterministic results, despite current limitations in evaluation methods.
The platform is available on a freemium tier, with a temporary option to sign up without a work email.

Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

Key takeaways:

Comments (0)

Newsletter