Claw-Anything: 사용자의 디지털 세계에 대한 더 넓은 접근을 가진 상시 작동 개인 비서의 벤치마킹

초록

대규모 언어 모델 에이전트는 점차 사용자의 디지털 세계에서 관련된 모든 것에 접근할 수 있는 항시 온 상태의 개인 비서로 구상되고 있다. 그러나 현재 시스템은 그 세계의 극히 일부만을 대상으로 작동하므로, 맥락에 민감한 추론과 효과적인 지원에 한계가 있다. 기존 벤치마크 역시 사용자 상태의 일부만을 제공하기 때문에, 이처럼 광범위하고 항시 온 상태인 환경에서의 성능을 포착하지 못한다. 이러한 격차를 해소하기 위해 우리는 Claw-Anything을 소개한다. 이 벤치마크는 에이전트의 맥락을 장기 활동 이력, 상호 의존적인 백엔드 서비스, 그리고 여러 기기에 걸친 통합 GUI 및 CLI 상호작용이라는 세 가지 차원으로 확장한다. 이 환경을 구현하기 위해, 우리는 다중 라운드 이벤트 주입을 통해 수개월 간의 사용자 활동을 시뮬레이션하여 복잡한 세계 상태와 현실적인 잡음(무관한 이벤트 및 상충하는 신호 포함)을 생성한다. 에이전트는 이러한 잡음에 강건함을 유지하면서 풍부한 맥락적 환경에 대해 추론해야 한다. 이러한 확장된 범위는 또한 사전 예방적 지원의 평가를 가능하게 하며, 에이전트가 사용자 니즈를 예측하고 적시에 추천을 제공해야 한다. 실험 결과 GPT-5.5는 34.5%의 pass@1만을 달성하여, 이전 벤치마크에 비해 현저히 낮은 성능을 보였으며, 이는 현재 에이전트 능력과 항시 온 상태의 개인 비서 요구 사이의 격차를 강조한다. 벤치마크와 함께, 우리는 2,000개의 훈련 환경을 생성하는 자동화된 데이터 생성 파이프라인을 공개하며, 이는 기본 모델의 성능을 23.7% 향상시켜 확장 가능한 데이터 인프라의 유용성을 입증한다.

English

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.