SABER：在有状态项目工作区中对LLM编码代理的操作安全性进行基准测试

摘要

大型语言模型正越来越多地被部署为编码代理，从而使安全性关注点从单个响应转向动作序列。然而，现有基准主要评估模型是否拒绝不安全提示，极少检验模型对带状态工作空间的实际影响。为此，我们提出SABER基准，这是一种面向环境感知的操作安全性评估框架，它将模型置于真实的代理风格项目中，通过一系列动作后的最终环境状态来评估安全性。除了二元的安全违规报告外，SABER还按原因对违规进行分类，从而能够分析不同模型的特定安全特征。我们的评估表明，即使性能最佳的模型，其有害安全违规率（HSR）也超过54%，这表明当前的模型对齐策略仍不足以应对真实项目环境。SABER还揭示了不同模型间截然不同的安全特征。本基准已公开于https://github.com/sssr-lab/saber。

English

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined. We present SABER, a benchmark for environment-aware operational safety that places models in realistic agent-style projects and evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54% harmful safety-violation rate (HSR), suggesting that current alignment remains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.