確保安全！在問答系統中對抗間接攻擊的大型語言模型上下文安全策略保持基準測試

摘要

随着大型语言模型（LLMs）在企业与政府等敏感领域中的部署日益增多，确保其在上下文中遵循用户定义的安全策略变得至关重要——尤其是在信息保密方面。尽管以往的LLM研究主要关注一般安全性和社会敏感数据，但针对上下文安全防护的大规模基准测试仍显不足。为此，我们引入了一个新颖的大规模基准数据集——CoPriva，用于评估LLM在问答任务中遵守上下文保密政策的情况。该数据集源自现实情境，包含明确的政策及查询设计，这些查询既包括直接攻击，也包括旨在获取禁止信息的具有挑战性的间接攻击。我们对10个LLM进行了基准测试，揭示了一个显著漏洞：许多模型违反用户定义的政策，泄露敏感信息。这一失败在应对间接攻击时尤为严重，凸显了当前LLM在敏感应用安全对齐方面存在的关键差距。我们的分析表明，尽管模型通常能够识别查询的正确答案，但在生成过程中融入政策约束方面却面临困难。相比之下，当被明确提示时，它们展现出了一定程度的输出修正能力。这些发现强调了开发更为稳健的方法以确保上下文安全的迫切需求。

English

As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.