确保安全！在问答场景下针对间接攻击的大语言模型上下文安全策略保持基准测试

摘要

随着大型语言模型（LLMs）在企业和政府等敏感领域的应用日益增多，确保其在上下文中遵循用户定义的安全策略变得至关重要，尤其是在信息保密方面。尽管以往的LLM研究主要集中在通用安全性和社会敏感数据上，但针对上下文安全防护的大规模基准测试仍显不足。为此，我们引入了一个新颖的大规模基准数据集——CoPriva，用于评估LLM在问答场景中对上下文保密策略的遵循情况。该数据集源自现实情境，包含明确的策略和查询，这些查询被设计为直接和具有挑战性的间接攻击，旨在获取被禁止的信息。我们在该基准上评估了10个LLM，揭示了一个显著漏洞：许多模型违反用户定义的策略，泄露敏感信息。这一失败在应对间接攻击时尤为严重，凸显了当前LLM在敏感应用中的安全对齐存在关键缺口。我们的分析表明，虽然模型通常能够识别查询的正确答案，但在生成过程中难以融入策略约束。相比之下，当明确提示时，它们展现出部分修正输出的能力。我们的发现强调了迫切需要更强大的方法来保障上下文安全。

English

As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.