邁向大型語言模型的安全推理:以AI代理式審議實現政策嵌入的思維鏈數據創建
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
May 27, 2025
作者: Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris
cs.AI
摘要
安全推理是近期提出的一種新範式,其中大型語言模型(LLMs)在生成回應前會對安全政策進行推理,從而緩解現有安全措施中的局限性,如過度拒絕和越獄漏洞。然而,實施這一範式具有挑戰性,因為創建高質量的政策嵌入思維鏈(CoT)數據集是一個資源密集的過程,同時還需確保推理的準確性,避免幻覺或政策衝突。為解決這一問題,我們提出了AIDSAFE:面向安全推理的代理迭代審議,這是一種新穎的數據生成方法,利用多代理審議來迭代擴展對安全政策的推理。AIDSAFE中的數據精煉階段通過消除重複、冗餘和欺騙性的思維來確保高質量輸出。AIDSAFE生成的CoT為基於監督微調(SFT)的安全訓練提供了堅實的基礎。此外,為滿足對齊階段(如DPO訓練)中偏好數據的需求,我們引入了一種補充方法,利用信念增強來創建獨特的選擇和拒絕CoT樣本。我們的評估表明,AIDSAFE生成的CoT在政策遵循和推理質量上表現優異。因此,我們證明,在這些CoT上對開源LLMs進行微調,可以顯著提高安全泛化能力和越獄魯棒性,同時保持可接受的實用性和過度拒絕準確性。AIDSAFE生成的CoT數據集可在以下網址找到:https://huggingface.co/datasets/AmazonScience/AIDSAFE
English
Safety reasoning is a recent paradigm where LLMs reason over safety policies
before generating responses, thereby mitigating limitations in existing safety
measures such as over-refusal and jailbreak vulnerabilities. However,
implementing this paradigm is challenging due to the resource-intensive process
of creating high-quality policy-embedded chain-of-thought (CoT) datasets while
ensuring reasoning remains accurate and free from hallucinations or policy
conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation
for Safety Reasoning, a novel data generation recipe that leverages multi-agent
deliberation to iteratively expand reasoning on safety policies. A data refiner
stage in AIDSAFE ensures high-quality outputs by eliminating repetitive,
redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong
foundation for supervised fine-tuning (SFT)-based safety training.
Additionally, to address the need of preference data in alignment stages, such
as DPO training, we introduce a supplemental recipe that uses belief
augmentation to create distinct selected and rejected CoT samples. Our
evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy
adherence and reasoning quality. Consequently, we show that fine-tuning
open-source LLMs on these CoTs can significantly improve safety generalization
and jailbreak robustness while maintaining acceptable utility and over-refusal
accuracy. AIDSAFE-generated CoT datasets can be found here:
https://huggingface.co/datasets/AmazonScience/AIDSAFESummary
AI-Generated Summary