VitaBench 2.0: 장기 사용자 상호작용에서 개인화된 능동적 에이전트 평가

초록

대규모 언어 모델(LLM)은 실세계 작업에서 사용자와 협력하는 대화형 에이전트로 진화하고 있다. 이러한 환경에서의 효과적인 협업은 사용자가 명시적으로 언급한 내용을 넘어서는 이해에 점점 더 의존하게 되는데, 이는 사용자 의도가 단편적인 일상 상호작용에 반영되는 경우가 많고 개인화된 모델링과 능동적 상호작용을 모두 필요로 하기 때문이다. 그러나 기존의 에이전트 벤치마크는 주로 추론과 도구 사용을 평가할 뿐, 현실적인 시나리오에서 사용자 선호도를 추론하고 활용하는 문제는 대체로 간과해 왔다. 이러한 격차를 해소하기 위해, 우리는 장기적인 사용자 상호작용에서 개인화되고 능동적인 에이전트 행동을 평가하기 위한 벤치마크인 VitaBench 2.0을 소개한다. VitaBench 2.0에서 작업은 개별 사용자에 대해 시간 순서대로 정렬된 시퀀스로 구성되며, 선호도는 단편적이고 이질적인 상호작용에 내재되어 있다. 작업을 성공적으로 완료하려면 에이전트가 이러한 상호작용으로부터 사용자 선호도를 지속적으로 추출하고 활용하며 갱신해야 한다. 또한, 에이전트가 누락된 정보를 인식하고 결정을 내리기 전에 사용자 또는 환경으로부터 적극적으로 이를 획득해야 하는 작업을 통해 능동성을 평가한다. 체계적인 분석을 지원하기 위해, 확장 가능한 메모리 인터페이스를 제공하여 다양한 메모리 아키텍처 간의 통제된 비교를 가능하게 한다. 우리는 다양한 최첨단 독점 및 오픈소스 LLM을 벤치마킹했다. 결과는 최첨단 모델조차도 실세계 개인화에 여전히 큰 어려움을 겪고 있음을 보여주며, 현재의 역량과 실질적 요구 사이에 상당한 격차가 있음을 드러낸다. 추가적인 심층 분석은 실세계 개인화 의사결정에서 현재 에이전트의 실패 양상과 역량 병목 현상을 밝혀내어, 향후 모델 개선을 위한 통찰력을 제공한다.

English

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.