VitaBench 2.0: 评估长期用户交互中的个性化与主动式智能体

摘要

大型语言模型（LLMs）已发展为在现实任务中与用户协作的交互式智能体。在此类环境下，有效协作日益依赖于超越用户明确表述的内容来理解其意图——因为用户意图往往体现在碎片化的日常互动中，需要同时具备个性化建模与主动交互能力。然而，现有智能体基准测试主要评估推理与工具运用能力，很大程度上忽视了在真实场景中推断并利用用户偏好的挑战。为弥补这一空白，我们提出VitaBench 2.0——一个用于评估长期用户交互中个性化与主动代理行为的基准测试。在VitaBench 2.0中，任务被组织为针对单个用户的时间有序序列，其偏好嵌入在碎片化且异构的交互过程中。任务的成功完成要求智能体持续从这些交互中提取、利用并更新用户偏好。我们进一步通过需要智能体识别缺失信息并在决策前主动从用户或环境中获取信息的任务来评估其主动性。为支持系统分析，我们提供了可扩展的记忆接口，能够对不同记忆架构进行受控比较。我们对一系列前沿商业与开源LLM进行了基准测试。结果显示，即便是最先进的模型，实现现实世界中的个性化仍极具挑战性，当前能力与实际需求之间存在显著差距。深入分析进一步揭示了当前智能体在真实个性化决策中的失败模式与能力瓶颈，为未来模型改进提供了洞察。

English

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.