クラウドボット（OpenClaw）の軌道に基づく安全性監査

要旨

Clawdbotは、ローカル実行とWebを介したワークフローにまたがる広範な行動空間を有する、ツール利用型のセルフホスト個人AIエージェントである。この特性は、曖昧な状況下や敵対的誘導において、安全性とセキュリティに関する懸念を特に高める。本研究では、6つのリスク次元にわたるClawdbotの軌道中心的な評価を提示する。テストスイートは、既存のエージェント安全性ベンチマーク（ATBench、LPS-Benchを含む）からシナリオを抽出して軽微な適応を施し、Clawdbotのツール操作面に特化して手設計したケースで補完した。完全なインタラクション軌道（メッセージ、アクション、ツール呼び出しの引数/出力）を記録し、自動軌道判定器（AgentDoG-Qwen3-4B）と人手レビューの両方を用いて安全性を評価した。34の標準ケース全体で、不均一な安全性プロファイルが明らかになった：信頼性重視タスクでは概ね一貫した性能を示す一方、大部分の失敗は意図が未定義な状況、開放的な目標、あるいは一見無害なジェイルブレイクプロンプトにおいて発生し、些細な誤解が高影響のツールアクションへとエスカレートする可能性が確認された。総合結果を代表的なケーススタディで補完し、これらの事例に共通する特性を要約、Clawdbotが実践で誘発しやすいセキュリティ脆弱性と典型的な故障モードを分析した。

English

Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk dimensions. Our test suite samples and lightly adapts scenarios from prior agent-safety benchmarks (including ATBench and LPS-Bench) and supplements them with hand-designed cases tailored to Clawdbot's tool surface. We log complete interaction trajectories (messages, actions, tool-call arguments/outputs) and assess safety using both an automated trajectory judge (AgentDoG-Qwen3-4B) and human review. Across 34 canonical cases, we find a non-uniform safety profile: performance is generally consistent on reliability-focused tasks, while most failures arise under underspecified intent, open-ended goals, or benign-seeming jailbreak prompts, where minor misinterpretations can escalate into higher-impact tool actions. We supplemented the overall results with representative case studies and summarized the commonalities of these cases, analyzing the security vulnerabilities and typical failure modes that Clawdbot is prone to trigger in practice.

クラウドボット（OpenClaw）の軌道に基づく安全性監査

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

要旨

Support