NESSiE: 必要安全性ベンチマーク ― 存在すべきでないエラーの特定

要旨

本論文では、大規模言語モデル（LLM）向けの必須安全性ベンチマーク「NESSiE」を提案する。NESSiEは、情報セキュリティとアクセスセキュリティにおける最小限のテストケースによって、タスクの複雑度が低いことを考慮すれば存在すべきでない安全性関連の欠陥を明らかにする。NESSiEは、言語モデルの安全性に対する軽量で簡便なサニティチェックとして設計されているため、一般的な安全性を保証するには不十分であるが、本テストを通過することはあらゆるデプロイにおいて必要条件であると我々は主張する。しかし、最先端のLLMであってもNESSiEで100%を達成しておらず、敵対的攻撃が存在しない場合であっても、我々が定める言語モデル安全性の必要条件を満たせていない。我々の提案する「安全かつ有益（SH）」指標により、これら二つの要件を直接比較可能とし、モデルが安全性よりも有益性に偏っていることを示す。さらに、一部のモデルでは推論機能が無効化されている場合、特に無害な注意散漫な文脈がモデルの性能を低下させることを確認した。全体として、本結果は、このようなモデルを自律エージェントとして現実環境にデプロイすることの重大なリスクを強調するものである。データセット、パッケージ、およびプロット用コードは公開予定である。

English

We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general -- but we argue that passing this test is necessary for any deployment. However, even state-of-the-art LLMs do not reach 100% on NESSiE and thus fail our necessary condition of language model safety, even in the absence of adversarial attacks. Our Safe & Helpful (SH) metric allows for direct comparison of the two requirements, showing models are biased toward being helpful rather than safe. We further find that disabled reasoning for some models, but especially a benign distraction context degrade model performance. Overall, our results underscore the critical risks of deploying such models as autonomous agents in the wild. We make the dataset, package and plotting code publicly available.

NESSiE: 必要安全性ベンチマーク ― 存在すべきでないエラーの特定

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

要旨

Support