ClawsBench: シミュレーテッドワークスペースにおけるLLM生産性エージェントの能力と安全性の評価

要旨

大規模言語モデル（LLM）エージェントは、生産性タスク（メール、スケジュール管理、文書管理など）の自動化に向けて展開が進められているが、実際のサービス上で評価することは不可逆的な変更のリスクを伴うため危険である。既存のベンチマークは簡素化された環境に依存しており、現実的で状態を保持するマルチサービスワークフローを捉えられていない。本研究では、現実的な生産性環境におけるLLMエージェントの評価と改善のためのベンチマーク「ClawsBench」を提案する。本ベンチマークは、完全な状態管理と決定論的なスナップショット/リストア機能を備えた5つの高精度模擬サービス（Gmail、Slack、Googleカレンダー、Googleドキュメント、Googleドライブ）と、単一サービス、クロスサービス、安全クリティカルなシナリオをカバーする44の構造化タスクで構成される。エージェントのスキャフォールディングを、プログレッシブディスクロージャーによりAPI知識を注入するドメイン技能と、サービス間の動作を調整するメタプロンプトという2つの独立した要素に分解し、それぞれを変化させて個別および組み合わせ効果を測定する。6つのモデル、4つのエージェントハーネス、33の条件での実験結果から、完全なスキャフォールディング条件下ではエージェントのタスク成功率は39～64%に達するが、安全でない行動率は7～33%を示すことがわかった。最高性能のOpenClawでは、トップ5モデルのタスク成功率は53～63%の10ポイント幅に収まり、安全でない行動率は7%から23%で、両指標間に一貫した順位関係は見られなかった。さらに、多段階のサンドボックスエスカレーションや暗黙的な契約変更など、8つの反復的な不安全行動パターンを特定した。

English

Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.

ClawsBench: シミュレーテッドワークスペースにおけるLLM生産性エージェントの能力と安全性の評価

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

要旨

Support