τ-Rec：一種用於智能代理推薦系統的可驗證基準

摘要

隨著推薦系統朝向具備代理能力、多輪對話介面的方向發展，評估範式已難以跟上腳步。現行的基準測試往往依賴「以大語言模型作為裁判」的評估方式，容易引入主觀性、高成本與不一致性。我們提出 τ-Rec，這是一個專為代理式推薦系統設計的基準，它用可驗證獎勵取代主觀評估，並採用一種「揭示標籤引出（RTE）機制」來控制任務限制條件在對話過程中如何呈現。透過讓代理面對結構化目錄謂詞進行測試，並採用 pass^k 可靠性指標，τ-Rec 為一致性的推理能力提供了系統性的檢驗。我們針對五個模型系列、九種配置進行評估——包括 GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B 及 GPT-5 mini——結果顯示出陡峭的可靠性斷崖；即便是表現最佳的模型，在 pass^1 上也僅達約 57%，在 pass^4 上更僅約 38%，突顯出當前對話式代理部署中的關鍵差距。所有程式碼與資料均已公開於 https://github.com/nbharaths/tau-rec。

English

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.