Claw-Anything: ユーザーのデジタル世界へのより広範なアクセスを持つ常時稼働型パーソナルアシスタントのベンチマーキング

要旨

大規模言語モデルエージェントは、ユーザーのデジタル世界におけるあらゆる関連情報にアクセス可能な、常時稼働のパーソナルアシスタントとしてますます構想されている。しかしながら、現在のシステムはその世界のごく一部だけを対象として動作しており、文脈に依存した推論や効果的な支援には限界がある。既存のベンチマークも同様に、部分的なユーザー状態しか提供しないため、このような広範で常時稼働の環境における性能を捉えることができない。このギャップを埋めるべく、我々はClaw-Anythingを導入する。これは、エージェントのコンテキストを長期にわたる活動履歴、相互依存するバックエンドサービス、複数デバイスにわたるGUIとCLIの統合操作という三つの次元に拡張するベンチマークである。この設定を具体化するため、我々はマルチラウンドのイベント注入を通じて数ヶ月分のユーザー活動をシミュレートし、複雑なワールド状態と、無関係なイベントや矛盾するシグナルを含む現実的なノイズを生成する。エージェントは、豊富な文脈環境を推論すると同時に、そのようなノイズに対してロバストであることが求められる。この拡張されたスコープにより、エージェントがユーザーのニーズを先取りし、タイムリーな推薦を提供する先回りした支援の評価も可能となる。実験では、GPT-5.5は34.5%のpass@1しか達成できず、従来のベンチマークを大幅に下回っており、現在のエージェントの能力と常時稼働のパーソナルアシスタンスの要求との間にギャップがあることが示された。ベンチマークに加えて、我々は2,000の訓練環境を生成する自動データ生成パイプラインを公開し、ベースモデルを23.7%改善した。これにより、スケーラブルなデータ基盤の有用性が実証されている。

English

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.