ChatPaper.aiChatPaper

面向大语言模型的安全推理:基于AI代理审议的策略嵌入思维链数据构建

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

May 27, 2025
作者: Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris
cs.AI

摘要

安全推理是一种新兴范式,其中大型语言模型(LLMs)在生成响应前先对安全策略进行推理,从而缓解现有安全措施中的局限性,如过度拒绝和越狱漏洞。然而,实施这一范式颇具挑战,因为创建高质量的策略嵌入思维链(CoT)数据集是一个资源密集型过程,同时还需确保推理的准确性,避免幻觉或策略冲突。为此,我们提出了AIDSAFE:面向安全推理的代理迭代审议,这是一种新颖的数据生成方法,它利用多代理审议来迭代扩展对安全策略的推理。AIDSAFE中的数据精炼阶段通过消除重复、冗余和欺骗性思维来确保高质量输出。AIDSAFE生成的CoT为基于监督微调(SFT)的安全训练提供了坚实基础。此外,针对对齐阶段(如DPO训练)对偏好数据的需求,我们引入了一种补充方法,利用信念增强来创建明确的选择与拒绝CoT样本。我们的评估表明,AIDSAFE生成的CoT在策略遵循和推理质量上表现卓越。因此,我们证明,在这些CoT上对开源LLMs进行微调,可以显著提升安全泛化能力和越狱鲁棒性,同时保持可接受的实用性和过度拒绝准确性。AIDSAFE生成的CoT数据集可在此处获取:https://huggingface.co/datasets/AmazonScience/AIDSAFE。
English
Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

Summary

AI-Generated Summary

PDF172May 30, 2025