SABER: ステートフルなプロジェクトワークスペースにおけるLLMコーディングエージェントの運用安全性のベンチマーク

要旨

大規模言語モデルはコーディングエージェントとしてますます導入されており、安全性の焦点が個々の応答から一連のアクションへと移行している。しかし既存のベンチマークは主にモデルが安全でないプロンプトを拒否するかどうかを評価しており、状態を保持するワークスペースへの影響はほとんど検討されていない。本稿では、環境を考慮した運用安全性のためのベンチマークであるSABERを提示する。SABERはモデルを現実的なエージェントスタイルのプロジェクトに配置し、一連のアクション後の最終的な環境状態から安全性を評価する。二値的な安全違反報告を超えて、SABERは違反を原因別に分類し、モデル固有の安全性プロファイルの分析を可能にする。我々の評価では、最高性能のモデルでさえ54%を超える有害な安全違反率（HSR）を示しており、現在のアライメントは現実的なプロジェクト環境には不十分であることが示唆される。SABERはさらにモデル間で明確に異なる安全性プロファイルを明らかにする。我々のベンチマークは https://github.com/sssr-lab/saber で公開されている。

English

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined. We present SABER, a benchmark for environment-aware operational safety that places models in realistic agent-style projects and evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54% harmful safety-violation rate (HSR), suggesting that current alignment remains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.