ChatPaper.aiChatPaper

Claw-Anything:對具備更廣泛用戶數位世界存取權限的始終在線個人助理進行基準測試

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

May 25, 2026
作者: Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen, Shuzhe Wu, Feiyang Pan, Lue Fan, Sanyuan Zhao, Dandan Tu
cs.AI

摘要

大型語言模型代理越來越被設想為永遠在線的個人助手,能夠存取用戶數位世界中任何相關的內容。然而,目前的系統僅能運作於該世界的狹窄片段,限制了對上下文敏感的推理與有效的協助。現有的基準測試同樣僅提供部分的用戶狀態,因此無法在如此廣泛且永遠在線的設定中捕捉效能表現。為了解決這一差距,我們提出了 Claw-Anything,這是一個沿三個維度擴展代理上下文的基準測試:長時程活動歷史、相互依賴的後端服務,以及跨多個裝置的整合圖形使用者介面(GUI)與命令列介面(CLI)互動。為了具體實現此設定,我們透過多輪事件注入模擬了長達數月的用戶活動,產生複雜的世界狀態與真實的雜訊,包括無關事件與衝突訊號。代理必須在豐富的上下文環境中進行推理,同時對這類雜訊保持穩健。這種擴展的範疇也使得對主動協助的評估成為可能,要求代理能夠預測用戶需求並提供及時的建議。實驗顯示,GPT-5.5 僅達到 34.5% 的 pass@1,遠低於先前的基準測試,凸顯了當前代理能力與永遠在線個人協助需求之間的差距。除了基準測試外,我們還釋出了一個自動化資料生成管線,該管線產生了 2,000 個訓練環境,並將基礎模型提升了 23.7%,展示了其可擴展資料基礎設施的實用性。
English
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.