Beating τ-bench retail by 10%

We beat the best public models on τ-bench retail by 10 percentage points. With three people.

This post is about what we did, what we learned, and why the manual work that got us there is exactly what we’re now automating.

What is τ-bench?

τ-bench is a benchmark for evaluating AI agents on realistic, multi-turn tasks. The retail variant tests an agent’s ability to handle customer service scenarios — processing returns, modifying orders, applying policies correctly, and navigating ambiguous edge cases.

It’s one of the harder benchmarks because it requires genuine reasoning about policy, not just pattern matching. The agent has to understand nuance: when a refund is appropriate, when to escalate, when the customer is technically wrong but should be accommodated anyway.

What we did

Our approach wasn’t a single clever trick. It was a structured optimization process applied relentlessly.

Phase 1: Failure clustering. We ran the benchmark, collected every failure, and manually categorized them into buckets. Not by error type — by root cause. “The agent misunderstood the return window” is different from “the agent applied the return policy correctly but missed an exception.” This distinction matters because they require different fixes.

Phase 2: Hypothesis generation. For each failure cluster, we formed hypotheses about what was going wrong at the policy level. Not “the prompt needs to be longer” but “the agent doesn’t distinguish between refund eligibility and refund preference.” These hypotheses were specific and testable.

Phase 3: Structured iteration. We didn’t just tweak prompts randomly. We applied changes methodically — reflection prompts for policy understanding, targeted examples for edge cases, mini-batch validation to catch regressions. Each change was tested against the specific failure cluster it targeted and against all previous passing cases.

Phase 4: Regression prevention. Every improvement was validated against the full suite. When a fix for one failure cluster broke something else, we didn’t just revert — we understood why and found a solution that addressed both.

What we learned

The most important finding wasn’t about τ-bench at all. It was this: the optimization process itself has a natural curriculum.

Early in the process, broad changes have outsized impact. Fixing a fundamental policy misunderstanding might resolve 15 failure cases at once. Later, improvements become more surgical — a specific edge case here, a subtle reasoning error there. The returns diminish, but the process is the same.

This is a learning-rate schedule for agent optimization. Large steps to small, with checkpoints, gates, and emergency stops in between.

Every team that ships agents discovers this pattern independently. They all build some version of it by hand. They all wish it was automated.

Why this matters

Three people, working manually through this process, beat models backed by teams of dozens. Not because we’re smarter — because the process is that powerful when applied systematically.

The bottleneck was never the model. It was the speed at which humans can iterate through the optimization loop. Read traces. Form hypotheses. Design experiments. Run them. Check for regressions. Repeat.

This is precisely the work we’re now building 4242.ai to automate. The manual workflow that got us to state-of-the-art on τ-bench retail is the blueprint for the system.

What’s next

We’re taking what we learned — the curriculum structure, the failure clustering approach, the regression-aware iteration — and encoding it into autonomous agents that can run this process continuously, at machine speed, across arbitrary domains.

The τ-bench result was proof of concept. The system is the product.