ChatPaper.aiChatPaper

邁向大型語言模型的安全推理:以AI代理式審議實現政策嵌入的思維鏈數據創建

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

May 27, 2025
作者: Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris
cs.AI

摘要

安全推理是近期提出的一種新範式,其中大型語言模型(LLMs)在生成回應前會對安全政策進行推理,從而緩解現有安全措施中的局限性,如過度拒絕和越獄漏洞。然而,實施這一範式具有挑戰性,因為創建高質量的政策嵌入思維鏈(CoT)數據集是一個資源密集的過程,同時還需確保推理的準確性,避免幻覺或政策衝突。為解決這一問題,我們提出了AIDSAFE:面向安全推理的代理迭代審議,這是一種新穎的數據生成方法,利用多代理審議來迭代擴展對安全政策的推理。AIDSAFE中的數據精煉階段通過消除重複、冗餘和欺騙性的思維來確保高質量輸出。AIDSAFE生成的CoT為基於監督微調(SFT)的安全訓練提供了堅實的基礎。此外,為滿足對齊階段(如DPO訓練)中偏好數據的需求,我們引入了一種補充方法,利用信念增強來創建獨特的選擇和拒絕CoT樣本。我們的評估表明,AIDSAFE生成的CoT在政策遵循和推理質量上表現優異。因此,我們證明,在這些CoT上對開源LLMs進行微調,可以顯著提高安全泛化能力和越獄魯棒性,同時保持可接受的實用性和過度拒絕準確性。AIDSAFE生成的CoT數據集可在以下網址找到:https://huggingface.co/datasets/AmazonScience/AIDSAFE
English
Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

Summary

AI-Generated Summary

PDF172May 30, 2025