τ-Rec: 에이전틱 추천 시스템을 위한 검증 가능한 벤치마크

초록

추천 시스템이 에이전트 기반의 다중 턴 대화형 인터페이스로 전환됨에 따라, 평가 패러다임은 이러한 변화를 따라잡는 데 어려움을 겪고 있다. 현재의 벤치마크는 종종 'LLM-as-a-judge' 평가에 의존하는데, 이는 주관성, 높은 비용 및 일관성 부족을 초래한다. 본 연구에서는 τ-Rec을 제안한다. τ-Rec은 에이전트 기반 추천 시스템을 위한 벤치마크로, 주관적 평가를 검증 가능한 보상으로 대체하고, 대화 중 작업 제약 조건이 드러나는 방식을 통제하는 'reveal-tagged elicitation (RTE)' 메커니즘을 도입한다. 에이전트를 구조화된 카탈로그 조건에 대해 테스트하고 pass^k 신뢰도 지표를 활용함으로써, τ-Rec은 일관된 추론을 위한 체계적인 테스트를 제공한다. GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B 및 GPT-5 mini 등 다섯 모델 계열의 아홉 가지 구성을 평가한 결과, 급격한 신뢰도 하락이 관찰되었다. 최고 성능 모델조차도 pass^1에서 약 57%, pass^4에서 약 38%의 성능을 보여, 현재 대화형 에이전트 배포에 있어 중요한 격차를 드러낸다. 모든 코드와 데이터는 https://github.com/nbharaths/tau-rec에서 공개적으로 이용 가능하다.

English

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.