SABER: 상태 저장형 프로젝트 워크스페이스에서 LLM 코딩 에이전트의 운영 안전성 벤치마킹

초록

대규모 언어 모델이 코딩 에이전트로 점점 더 많이 배치되면서, 안전성 평가가 개별 응답에서 행동 시퀀스로 전환되고 있다. 그러나 기존 벤치마크는 주로 모델이 안전하지 않은 프롬프트를 거부하는지 여부를 평가할 뿐, 상태를 유지하는 작업 공간에 미치는 영향은 거의 검토되지 않은 상태로 남아 있다. 우리는 환경을 인식하는 운용 안전성을 위한 벤치마크인 SABER를 제시한다. 이는 모델을 현실적인 에이전트 스타일 프로젝트에 배치하고, 일련의 행동 후 최종 환경 상태에서 안전성을 평가한다. 이진적 안전 위반 보고를 넘어, SABER는 위반을 원인별로 분류하여 모델별 안전 프로파일 분석을 가능하게 한다. 우리의 평가에 따르면 최고 성능의 모델조차도 54% 이상의 유해 안전 위반율(HSR)을 보여, 현재의 정렬이 현실적인 프로젝트 환경에 충분하지 않음을 시사한다. SABER는 또한 모델 간에 뚜렷한 안전 프로파일을 드러낸다. 우리의 벤치마크는 https://github.com/sssr-lab/saber에서 공개적으로 이용 가능하다.

English

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined. We present SABER, a benchmark for environment-aware operational safety that places models in realistic agent-style projects and evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54% harmful safety-violation rate (HSR), suggesting that current alignment remains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.