τ-Rec: エージェンティック推薦システムのための検証可能なベンチマーク

要旨

レコメンダーシステムが自律的・多ターン対話型インタフェースへと移行する中で、評価パラダイムはその進展に追いついていない。現在のベンチマークは多くの場合「LLM-as-a-judge」評価に依存しており、これには主観性、高コスト、非一貫性といった問題が伴う。本稿では、τ-Recという自律的レコメンダーシステム向けベンチマークを提案する。τ-Recは、主観的評価を検証可能な報酬に置き換え、さらに「reveal-tagged elicitation (RTE)」メカニズムによって、対話中にタスク制約がどのように顕在化するかを制御する。エージェントを構造化されたカタログ述語に対してテストし、pass^k信頼性指標を採用することで、τ-Recは一貫した推論のための体系的なテストを提供する。5つのモデルファミリー（GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B、GPT-5 mini）にわたる9つの構成を評価した結果、急峻な信頼性の崖（reliability cliff）が明らかになった。最高性能モデルでもpass^1で約57%、pass^4で約38%にとどまり、現在の対話型エージェントの展開における重大なギャップが浮き彫りとなった。すべてのコードとデータは https://github.com/nbharaths/tau-rec で公開されている。

English

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.