UserBench: 사용자 중심 에이전트를 위한 인터랙티브 체육관 환경

초록

대형 언어 모델(LLM) 기반 에이전트는 추론과 도구 사용 분야에서 인상적인 진전을 이루며 복잡한 과제 해결이 가능해졌다. 그러나, 특히 목표가 모호하거나 진화적이거나 간접적으로 표현된 상황에서 사용자와 능동적으로 협력하는 능력은 아직 충분히 탐구되지 않았다. 이러한 격차를 해결하기 위해, 우리는 다중 턴, 선호도 기반 상호작용에서 에이전트를 평가하기 위해 설계된 사용자 중심 벤치마크인 UserBench를 소개한다. UserBench는 명확하지 않은 목표로 시작하여 점진적으로 선호도를 드러내는 시뮬레이션된 사용자를 특징으로 하며, 에이전트가 의도를 능동적으로 명확히 하고 도구를 사용하여 근거 있는 결정을 내리도록 요구한다. 주요 오픈소스 및 클로즈드소스 LLM에 대한 평가 결과, 과제 완료와 사용자 정렬 간에 상당한 괴리가 있음이 드러났다. 예를 들어, 모델은 평균적으로 모든 사용자 의도와 완전히 일치하는 답변을 20%의 경우에만 제공하며, 가장 발전된 모델조차도 능동적 상호작용을 통해 모든 사용자 선호도의 30% 미만을 파악한다. 이러한 결과는 단순히 유능한 과제 수행자뿐만 아니라 진정한 협력 파트너로서의 에이전트를 구축하는 데 있어 도전 과제를 강조한다. UserBench는 이러한 중요한 역량을 측정하고 발전시키기 위한 상호작용 환경을 제공한다.

English

Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.