π-Bench：评估长周期工作流中的主动式个人助理代理

摘要

个人助理代理（如OpenClaw）的兴起，凸显了大语言模型在支持用户日常生活与工作方面的巨大潜力。在此类场景中，核心挑战在于主动式协助——因为用户初始请求往往含糊不清，且未明确说明重要的需求、约束或偏好。然而，现有基准测试很少评估代理能否在用户明确表达意图前识别并响应此类未明示意图，尤其是在用户需求逐步显现的持续性多轮交互中。为填补这一空白，我们提出π-Bench——一个面向主动式协助的基准测试，包含跨5个领域用户画像的100个多轮任务。通过整合未明示用户意图、任务间依赖关系及跨会话连续性，π-Bench可评估代理在长时交互中预判并响应用户需求的能力，在更贴近真实使用场景的长期轨迹中同步衡量任务完成度与主动性。实验表明：（1）主动式协助仍具挑战性；（2）任务完成与主动性之间存在显著差异；（3）前期交互对后续任务中未明示意图的化解具有重要价值。

English

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce π-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, π-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.