VitaBench 2.0: 長期的なユーザーインタラクションにおけるパーソナライズドかつプロアクティブなエージェントの評価

要旨

大規模言語モデル（LLM）は、実世界のタスクにおいてユーザーと協働する対話型エージェントへと進化している。このような環境での効果的な協働は、明示的に述べられた内容を超えてユーザーを理解することにますます依存している。なぜなら、ユーザーの意図は断片的な日常的やり取りに反映されることが多く、個別化されたモデリングと積極的な対話の両方を必要とするからである。しかし、既存のエージェントベンチマークは主に推論やツール使用を評価しており、現実的なシナリオにおけるユーザーの嗜好の推測と活用の課題をほとんど考慮していない。このギャップを埋めるため、我々は長期的なユーザー対話における個別化された積極的なエージェント行動を評価するベンチマーク、VitaBench 2.0を導入する。VitaBench 2.0では、タスクは個々のユーザーに対して時間順に整理された系列として構成され、嗜好は断片的で異質なやり取りに埋め込まれている。タスクを成功裏に完了するには、エージェントがこれらのやり取りから継続的にユーザーの嗜好を抽出、活用、更新する必要がある。さらに、エージェントが情報不足を認識し、意思決定前にユーザーや環境から積極的に情報を取得する必要があるタスクを通じて、積極性を評価する。系統的な分析を支援するため、拡張可能なメモリインターフェースを提供し、異なるメモリアーキテクチャ間の制御された比較を可能にする。我々は、最先端のプロプライエタリおよびオープンソースの多様なLLMをベンチマークする。結果は、現実世界での個別化が最先端モデルにとっても依然として非常に困難であり、現在の能力と実用的要件との間に大きなギャップがあることを示している。さらに詳細な分析により、現実世界での個別化意思決定における現在のエージェントの失敗モードと能力のボトルネックが明らかになり、将来のモデル改善への洞察を提供する。

English

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.