Ask HN: What tools are you using for AI evals? Everything feels half-baked

The article discusses the challenges faced by a team running LLMs in production for content generation, customer support, and code review assistance, particularly in building an effective evaluation pipeline. They have tested several tools, including OpenAI's Evals framework, LangSmith, Weights & Biases, Humanloop, and Braintrust, but each has significant limitations. OpenAI's Evals is good for benchmarking but not for custom use cases and real-time monitoring. LangSmith has strong tracing but secondary evaluation features and can be costly. Weights & Biases is powerful but complex and not user-friendly for non-ML experts. Humanloop offers a clean interface but limited evaluation types and high pricing. Braintrust is promising but lacks documentation and integration options.

The team needs a solution that offers real-time evaluation monitoring, custom evaluation functions, human-in-the-loop workflows, cost tracking, and integration with their existing observability stack, while being accessible to their product team. Currently, they rely on custom scripts and monitoring dashboards, supplemented by manual reviews, which are not scalable and miss edge cases. They are seeking recommendations for tools that effectively handle production LLM evaluation, particularly from teams without dedicated ML engineers, and question whether their expectations are too high or if the available tooling is indeed immature.

Key takeaways:

Current evaluation tools for LLMs have significant limitations, particularly in custom use cases and real-time monitoring.
OpenAI's Evals framework, LangSmith, Weights & Biases, Humanloop, and Braintrust each have specific drawbacks, such as complexity, cost, and limited functionality.
The team needs a solution that supports real-time evaluation, custom functions, human-in-the-loop workflows, cost tracking, and integration with existing systems.
The current workaround involves custom scripts and manual reviews, which are not scalable and miss edge cases, indicating a gap in mature tooling for LLM evaluation.

Ask HN: What tools are you using for AI evals? Everything feels half-baked

Key takeaways:

Comments (0)

Newsletter