LLMエージェントのコールドスタート安全性ギャップ

要旨

ツール呼び出しLLMエージェントは、会話全体を通じて常に同じ安全性を保っているのでしょうか？実際はそうではありません。エージェントはセッションの開始直後に最も脆弱であり、通常のエージェントタスクを数回実行した後には大幅に安全性が向上します。この現象を「コールドスタート安全ギャップ（cold-start safety gap）」と呼びます。この現象を体系的に研究するために、我々は「エージェントの安全性の深さ（SODA）」ベンチマークを導入します。これは、エージェントが安全上の脅威に遭遇するまでに実行する通常のエージェントタスクの数を制御し、最大20件の先行タスクをサポートします。4つのファミリーから7つのモデルを評価した結果、先行する通常エージェントタスクの数がゼロから20に増えるにつれて、安全性は9～52%向上しました。表現分析により、モデルの隠れ状態が先行タスクの増加に伴い、安全性に配慮した領域へと徐々にシフトすることが確認されました。先行する会話のどの部分が最も重要かを体系的に調査した結果、通常のエージェントタスク自体が安全性の主な要因である一方、エージェント自身の過去の応答は安全性への影響は小さいものの、後のユーティリティを維持するために不可欠であることがわかりました。この結論は、オープンソースの安全性ベンチマーク（AgentHarm、Agent Safety Bench）およびユーティリティベンチマーク（BFCL、API-Bank）での評価によってさらに裏付けられ、展開前に通常のエージェントタスクでエージェントをウォームアップすることで安全性が向上し、全機能が維持されることが確認されました。これらの知見に基づき、我々は簡単な展開戦略を提案します。すなわち、エージェントを安全性が重要な要求にさらす前に、いくつかの通常のエージェントタスクを実行させることで、コールドスタート安全ギャップを緩和できます。コードはhttps://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap で入手可能です。

English

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap