面向大语言模型的安全推理：基于AI代理审议的策略嵌入思维链数据构建

摘要

安全推理是一种新兴范式，其中大型语言模型（LLMs）在生成响应前先对安全策略进行推理，从而缓解现有安全措施中的局限性，如过度拒绝和越狱漏洞。然而，实施这一范式颇具挑战，因为创建高质量的策略嵌入思维链（CoT）数据集是一个资源密集型过程，同时还需确保推理的准确性，避免幻觉或策略冲突。为此，我们提出了AIDSAFE：面向安全推理的代理迭代审议，这是一种新颖的数据生成方法，它利用多代理审议来迭代扩展对安全策略的推理。AIDSAFE中的数据精炼阶段通过消除重复、冗余和欺骗性思维来确保高质量输出。AIDSAFE生成的CoT为基于监督微调（SFT）的安全训练提供了坚实基础。此外，针对对齐阶段（如DPO训练）对偏好数据的需求，我们引入了一种补充方法，利用信念增强来创建明确的选择与拒绝CoT样本。我们的评估表明，AIDSAFE生成的CoT在策略遵循和推理质量上表现卓越。因此，我们证明，在这些CoT上对开源LLMs进行微调，可以显著提升安全泛化能力和越狱鲁棒性，同时保持可接受的实用性和过度拒绝准确性。AIDSAFE生成的CoT数据集可在此处获取：https://huggingface.co/datasets/AmazonScience/AIDSAFE。

English

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

面向大语言模型的安全推理：基于AI代理审议的策略嵌入思维链数据构建

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

摘要

Support