ChatPaper.aiChatPaper

万物爪取:对具有更广泛数字世界访问权限的始终在线个人助手的基准测试

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

May 25, 2026
作者: Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen, Shuzhe Wu, Feiyang Pan, Lue Fan, Sanyuan Zhao, Dandan Tu
cs.AI

摘要

大语言模型智能体正日益被设想为始终在线的个人助手,能够访问用户数字世界中所有相关信息。然而,当前系统仅在该世界的狭小范围内运行,限制了基于上下文的推理和有效协助。现有基准测试同样仅提供部分用户状态,因此无法涵盖这种广泛且始终在线的场景下的性能表现。为填补这一空白,我们提出了Claw-Anything基准测试,该测试从三个维度扩展了智能体的上下文:长时间跨度的活动历史、相互依赖的后端服务、以及跨多设备的图形用户界面与命令行界面集成交互。为实例化该场景,我们通过多轮事件注入模拟了数月的用户活动,生成了复杂的世界状态和真实的噪声,包括无关事件和冲突信号。智能体需在丰富的上下文环境中进行推理,同时保持对这类噪声的鲁棒性。这一扩展范围还使主动协助的评估成为可能,要求智能体预测用户需求并提供及时建议。实验表明,GPT-5.5的pass@1仅达34.5%,远低于先前基准测试的结果,凸显了当前智能体能力与始终在线个人助手需求之间的差距。伴随该基准测试,我们发布了一个自动化数据生成管道,可产出2000个训练环境,并将基础模型性能提升23.7%,证明了可扩展数据基础设施的实用性。
English
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.