SABER：在具狀態的專案工作區中對LLM編碼代理的操作安全性進行基準測試

摘要

大型語言模型日益被部署為編碼智能體，將安全性從個別回應轉移到行動序列。然而，現有基準測試主要評估模型是否拒絕不安全提示，而未充分檢視對有狀態工作區的影響。我們提出 SABER，一個用於環境感知操作安全性的基準測試，該測試將模型置於實際的智能體風格專案中，並根據一系列行動後的最終環境狀態來評估安全性。除了二元安全違規報告外，SABER 依原因分類違規，從而能夠分析模型特定的安全概況。我們的評估顯示，即使表現最佳的模型，其有害安全違規率（HSR）也超過 54%，這表明當前的對齊在實際專案環境中仍不足。SABER 進一步揭示了不同模型間的安全概況差異。我們的基準測試公開於 https://github.com/sssr-lab/saber。

English

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined. We present SABER, a benchmark for environment-aware operational safety that places models in realistic agent-style projects and evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54% harmful safety-violation rate (HSR), suggesting that current alignment remains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.