The team needs a solution that offers real-time evaluation monitoring, custom evaluation functions, human-in-the-loop workflows, cost tracking, and integration with their existing observability stack, while being accessible to their product team. Currently, they rely on custom scripts and monitoring dashboards, supplemented by manual reviews, which are not scalable and miss edge cases. They are seeking recommendations for tools that effectively handle production LLM evaluation, particularly from teams without dedicated ML engineers, and question whether their expectations are too high or if the available tooling is indeed immature.
Key takeaways:
- Current evaluation tools for LLMs have significant limitations, particularly in custom use cases and real-time monitoring.
- OpenAI's Evals framework, LangSmith, Weights & Biases, Humanloop, and Braintrust each have specific drawbacks, such as complexity, cost, and limited functionality.
- The team needs a solution that supports real-time evaluation, custom functions, human-in-the-loop workflows, cost tracking, and integration with existing systems.
- The current workaround involves custom scripts and manual reviews, which are not scalable and miss edge cases, indicating a gap in mature tooling for LLM evaluation.