VitaBench 2.0：評估個人化與主動型代理在長期用戶互動中的表現

摘要

大型語言模型（LLMs）已演變為能在真實世界任務中與使用者協作的互動式代理。在此類情境下，有效的協作日益依賴於理解使用者的言外之意，因為使用者意圖往往體現在片段化的日常互動中，需要個人化建模與主動互動並行。然而，現有代理基準主要評估推理與工具使用能力，很大程度上忽略了在現實場景中推斷並運用使用者偏好所帶來的挑戰。為填補此缺口，我們提出 VitaBench 2.0，一個用於評估長期使用者互動中個人化與主動代理行為的基準。在 VitaBench 2.0 中，任務以個別使用者的時間排序序列來組織，其中偏好嵌於片段化且異質的互動之中。成功完成任務要求代理從這些互動中持續提取、運用並更新使用者偏好。我們進一步透過需要代理辨識缺失資訊，並在決策前主動從使用者或環境中獲取資訊的任務來評估其主動性。為支援系統性分析，我們提供可擴展的記憶介面，實現不同記憶架構間的可控比較。我們對一系列前沿的專有及開源大型語言模型進行了基準測試。結果顯示，即便對於最先進的模型，真實世界中的個人化仍極具挑戰，凸顯出當前能力與實際需求之間的顯著差距。深入分析進一步揭示了當前代理在真實世界個人化決策中的失敗模式與能力瓶頸，為未來模型改進提供了洞見。

English

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.