セキュリティを維持せよ！質問応答における間接的攻撃に対する大規模言語モデルコンテキストでのセキュリティポリシー保持のベンチマーキング

要旨

大規模言語モデル（LLM）が企業や政府などの機密性の高い領域でますます展開される中、コンテキスト内でユーザー定義のセキュリティポリシーに準拠することが極めて重要です。特に、情報の非開示に関してはその重要性が増しています。これまでのLLM研究は一般的な安全性や社会的にセンシティブなデータに焦点を当ててきましたが、攻撃に対するコンテキスト上のセキュリティ維持を評価する大規模なベンチマークは依然として不足しています。この問題に対処するため、我々は質問応答におけるLLMのコンテキスト上の非開示ポリシー準拠を評価する新たな大規模ベンチマークデータセット「CoPriva」を導入しました。現実的なコンテキストから派生したこのデータセットには、明示的なポリシーと、禁止された情報を求める直接的および挑戦的な間接的攻撃として設計されたクエリが含まれています。我々は10のLLMをこのベンチマークで評価し、多くのモデルがユーザー定義のポリシーに違反し、機密情報を漏洩するという重大な脆弱性を明らかにしました。この失敗は特に間接的攻撃に対して顕著であり、機密性の高いアプリケーションにおける現在のLLMの安全性調整における重大なギャップを浮き彫りにしています。我々の分析によると、モデルはクエリに対する正しい回答を識別できることが多いものの、生成中にポリシー制約を組み込むことに苦労しています。一方で、明示的に促された場合には出力を修正する部分的な能力を示します。これらの発見は、コンテキスト上のセキュリティを保証するためのより堅牢な方法の緊急の必要性を強調しています。

English

As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.

セキュリティを維持せよ！質問応答における間接的攻撃に対する大規模言語モデルコンテキストでのセキュリティポリシー保持のベンチマーキング

Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering

要旨

Support