用戶基準：一個用於以用戶為中心代理的互動式健身房環境

摘要

基於大型語言模型（LLMs）的智能體在推理與工具使用方面取得了顯著進展，使其能夠解決複雜任務。然而，這些智能體在主動與用戶協作方面的能力，尤其是在目標模糊、動態變化或間接表達的情況下，仍未被充分探索。為填補這一空白，我們引入了UserBench，這是一個以用戶為中心的基準測試，旨在評估智能體在多輪次、偏好驅動的互動中的表現。UserBench模擬了初始目標不明確並逐步揭示偏好的用戶，要求智能體主動澄清意圖並基於工具做出有根據的決策。我們對領先的開源與閉源LLMs進行評估，發現任務完成度與用戶對齊度之間存在顯著脫節。例如，模型提供的答案平均僅有20%的時間完全符合所有用戶意圖，即使是最先進的模型，通過主動互動也僅能揭示不到30%的用戶偏好。這些結果凸顯了構建不僅是高效任務執行者，更是真正協作夥伴的智能體所面臨的挑戰。UserBench提供了一個互動環境，用以衡量並推進這一關鍵能力。

English

Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.

用戶基準：一個用於以用戶為中心代理的互動式健身房環境

UserBench: An Interactive Gym Environment for User-Centric Agents

摘要

Support