Session Details
New models launch every week — GPT, Claude, Gemini, and more — but how do you know which one actually works best for your customer support use case? The answer: rigorous evals.
Join the IrisAgent team for a hands-on session on how we build, maintain, and iterate on proprietary eval datasets to benchmark every leading LLM against real customer support scenarios. We'll share how we've tested 7+ models across accuracy, latency, and cost — and iterated through 29 prompt versions — so our customers always get the best-performing AI agent.
What You'll Learn:
Join the IrisAgent team for a hands-on session on how we build, maintain, and iterate on proprietary eval datasets to benchmark every leading LLM against real customer support scenarios. We'll share how we've tested 7+ models across accuracy, latency, and cost — and iterated through 29 prompt versions — so our customers always get the best-performing AI agent.
What You'll Learn:
- How to design eval datasets that reflect real CX scenarios and edge cases.
- A framework for benchmarking LLMs on accuracy, latency, and cost simultaneously.
- Why prompt versioning matters — and how to systematically optimize across models.
- How to catch regressions before they reach production.
- Lessons from benchmarking the latest models, including Claude Opus 4.6, o4-mini, Gemini 3 Pro, GPT-5.1, and more.
Session Details
New models launch every week — but how do you know which one works best for your CX use case? Join us to learn how IrisAgent builds eval datasets to benchmark every leading LLM.
New models launch every week — but how do you know which one works best for your CX use case? Join us to learn how IrisAgent builds eval datasets to benchmark every leading LLM.