ChatPaper.aiChatPaper

用户基准:面向用户中心型智能体的交互式训练环境

UserBench: An Interactive Gym Environment for User-Centric Agents

July 29, 2025
作者: Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
cs.AI

摘要

基于大语言模型(LLMs)的智能体在推理与工具使用方面取得了显著进展,使其能够解决复杂任务。然而,这些智能体在与用户主动协作方面的能力,尤其是在目标模糊、动态变化或间接表达的情况下,仍未被充分探索。为填补这一空白,我们引入了UserBench,一个以用户为中心的基准测试,旨在评估智能体在多轮、偏好驱动的交互中的表现。UserBench模拟了初始目标不明确的用户,逐步揭示其偏好,要求智能体主动澄清意图并利用工具做出有依据的决策。我们对领先的开源与闭源LLMs的评估显示,任务完成度与用户契合度之间存在显著脱节。例如,模型平均仅能在20%的情况下提供完全符合所有用户意图的答案,即使是最先进的模型,通过主动交互也仅能发现不到30%的用户偏好。这些结果凸显了构建不仅是高效任务执行者,更是真正协作伙伴的智能体所面临的挑战。UserBench提供了一个互动环境,用以衡量并推进这一关键能力的发展。
English
Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.
PDF274August 12, 2025