π-Bench：評估長期工作流程中的主動式個人助理代理

摘要

個人助理代理（如OpenClaw）的興起，凸顯了大規模語言模型在支援用戶日常生活與工作方面的潛力日益增長。此類場景的核心挑戰在於主動式輔助，因為用戶常以模糊不清的請求出發，並未明確表達重要的需求、限制或偏好。然而，現有的基準測試鮮少評估代理是否能在隱藏意圖被明確說出前加以識別並採取行動，特別是在用戶需求逐漸浮現的持續多輪互動情境中。為填補此缺口，我們提出π-Bench，一個專為主動式輔助設計的基準測試，包含橫跨5個特定領域用戶角色的100項多輪任務。透過納入隱藏用戶意圖、任務間相依性及跨會話連續性，π-Bench評估代理在長時間互動中預測並因應用戶需求的能力，同時衡量長期任務軌跡中的主動性與任務完成度，更貼近真實使用情境。實驗顯示：(1) 主動式輔助仍具挑戰；(2) 任務完成度與主動性之間存在明顯區別；(3) 過往互動對於後續任務中主動意圖解析具有重要價值。

English

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce π-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, π-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.