Building Eval Datasets
for CX AI Agents

Mar 4, 2026 12:00 PM PT

header for live session on AI evals for customer support

Session Details

New models launch every week — GPT, Claude, Gemini, and more — but how do you know which one actually works best for your customer support use case? The answer: rigorous evals.

Join the IrisAgent team for a hands-on session on how we build, maintain, and iterate on proprietary eval datasets to benchmark every leading LLM against real customer support scenarios. We'll share how we've tested 7+ models across accuracy, latency, and cost — and iterated through 29 prompt versions — so our customers always get the best-performing AI agent.

What You'll Learn:

How to design eval datasets that reflect real CX scenarios and edge cases.
A framework for benchmarking LLMs on accuracy, latency, and cost simultaneously.
Why prompt versioning matters — and how to systematically optimize across models.
How to catch regressions before they reach production.
Lessons from benchmarking the latest models, including Claude Opus 4.6, o4-mini, Gemini 3 Pro, GPT-5.1, and more.

Whether you're just starting to evaluate LLMs or looking to level up your existing eval practice, this session will give you a practical playbook you can apply immediately.

Session Details

New models launch every week — but how do you know which one works best for your CX use case? Join us to learn how IrisAgent builds eval datasets to benchmark every leading LLM.

Building Eval Datasetsfor CX AI Agents

Register for the Live Session

Building Eval Datasets
for CX AI Agents