UserBench: ユーザー中心エージェントのためのインタラクティブなジム環境

要旨

大規模言語モデル（LLM）ベースのエージェントは、推論とツール使用において目覚ましい進歩を遂げ、複雑なタスクを解決できるようになりました。しかし、特に目標が曖昧で変化したり、間接的に表現されたりする場合に、ユーザーと積極的に協力する能力については、まだ十分に検討されていません。このギャップを埋めるため、私たちはUserBenchを導入しました。これは、マルチターンで嗜好駆動型のインタラクションにおいてエージェントを評価するために設計された、ユーザー中心のベンチマークです。UserBenchは、最初に不特定の目標を持ち、嗜好を段階的に明らかにするシミュレートされたユーザーを特徴としており、エージェントが意図を積極的に明確にし、ツールを用いて根拠に基づいた意思決定を行うことを要求します。主要なオープンソースおよびクローズドソースのLLMを評価した結果、タスクの完了とユーザーとの整合性の間に大きな乖離があることが明らかになりました。例えば、モデルが提供する回答がすべてのユーザーの意図に完全に一致するのは平均で20％の時間しかなく、最も先進的なモデルでさえ、積極的なインタラクションを通じてすべてのユーザーの嗜好の30％未満しか明らかにしません。これらの結果は、単に有能なタスク実行者ではなく、真の協力パートナーとなるエージェントを構築することの難しさを浮き彫りにしています。UserBenchは、この重要な能力を測定し、進歩させるためのインタラクティブな環境を提供します。

English

Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.