面向大语言模型的安全推理:基于AI代理审议的策略嵌入思维链数据构建
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
May 27, 2025
作者: Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris
cs.AI
摘要
安全推理是一种新兴范式,其中大型语言模型(LLMs)在生成响应前先对安全策略进行推理,从而缓解现有安全措施中的局限性,如过度拒绝和越狱漏洞。然而,实施这一范式颇具挑战,因为创建高质量的策略嵌入思维链(CoT)数据集是一个资源密集型过程,同时还需确保推理的准确性,避免幻觉或策略冲突。为此,我们提出了AIDSAFE:面向安全推理的代理迭代审议,这是一种新颖的数据生成方法,它利用多代理审议来迭代扩展对安全策略的推理。AIDSAFE中的数据精炼阶段通过消除重复、冗余和欺骗性思维来确保高质量输出。AIDSAFE生成的CoT为基于监督微调(SFT)的安全训练提供了坚实基础。此外,针对对齐阶段(如DPO训练)对偏好数据的需求,我们引入了一种补充方法,利用信念增强来创建明确的选择与拒绝CoT样本。我们的评估表明,AIDSAFE生成的CoT在策略遵循和推理质量上表现卓越。因此,我们证明,在这些CoT上对开源LLMs进行微调,可以显著提升安全泛化能力和越狱鲁棒性,同时保持可接受的实用性和过度拒绝准确性。AIDSAFE生成的CoT数据集可在此处获取:https://huggingface.co/datasets/AmazonScience/AIDSAFE。
English
Safety reasoning is a recent paradigm where LLMs reason over safety policies
before generating responses, thereby mitigating limitations in existing safety
measures such as over-refusal and jailbreak vulnerabilities. However,
implementing this paradigm is challenging due to the resource-intensive process
of creating high-quality policy-embedded chain-of-thought (CoT) datasets while
ensuring reasoning remains accurate and free from hallucinations or policy
conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation
for Safety Reasoning, a novel data generation recipe that leverages multi-agent
deliberation to iteratively expand reasoning on safety policies. A data refiner
stage in AIDSAFE ensures high-quality outputs by eliminating repetitive,
redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong
foundation for supervised fine-tuning (SFT)-based safety training.
Additionally, to address the need of preference data in alignment stages, such
as DPO training, we introduce a supplemental recipe that uses belief
augmentation to create distinct selected and rejected CoT samples. Our
evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy
adherence and reasoning quality. Consequently, we show that fine-tuning
open-source LLMs on these CoTs can significantly improve safety generalization
and jailbreak robustness while maintaining acceptable utility and over-refusal
accuracy. AIDSAFE-generated CoT datasets can be found here:
https://huggingface.co/datasets/AmazonScience/AIDSAFESummary
AI-Generated Summary