τ-Rec:智能体推荐系统的可验证基准
τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems
June 8, 2026
作者: Bharath Sivaram Narasimhan, Karthik R Narasimhan
cs.AI
摘要
随着推荐系统向具有自主能力的多轮对话界面转变,评估范式始终难以跟上步伐。当前基准测试通常依赖"大模型即评判者"(LLM-as-a-judge)评估方式,这引入了主观性、高成本及不一致性。我们提出τ-Rec基准,通过可验证奖励机制及揭示标记的启发机制(RTE)取代主观评估,后者可控制任务约束条件在对话中的呈现方式。通过基于结构化目录谓词测试智能体,并采用pass^k可靠性指标,τ-Rec为一致性推理提供了系统化检验。我们对五类模型家族(GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B及GPT-5 mini)的九种配置进行了评估,揭示了严峻的可靠性断崖现象——即使最优模型在pass^1指标下仅达约57%,在pass^4下更低至约38%,凸显了当前对话式智能体部署中的关键缺陷。所有代码及数据均已开源:https://github.com/nbharaths/tau-rec。
English
As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.