邁向大型語言模型的安全推理：以AI代理式審議實現政策嵌入的思維鏈數據創建

摘要

安全推理是近期提出的一種新範式，其中大型語言模型（LLMs）在生成回應前會對安全政策進行推理，從而緩解現有安全措施中的局限性，如過度拒絕和越獄漏洞。然而，實施這一範式具有挑戰性，因為創建高質量的政策嵌入思維鏈（CoT）數據集是一個資源密集的過程，同時還需確保推理的準確性，避免幻覺或政策衝突。為解決這一問題，我們提出了AIDSAFE：面向安全推理的代理迭代審議，這是一種新穎的數據生成方法，利用多代理審議來迭代擴展對安全政策的推理。AIDSAFE中的數據精煉階段通過消除重複、冗餘和欺騙性的思維來確保高質量輸出。AIDSAFE生成的CoT為基於監督微調（SFT）的安全訓練提供了堅實的基礎。此外，為滿足對齊階段（如DPO訓練）中偏好數據的需求，我們引入了一種補充方法，利用信念增強來創建獨特的選擇和拒絕CoT樣本。我們的評估表明，AIDSAFE生成的CoT在政策遵循和推理質量上表現優異。因此，我們證明，在這些CoT上對開源LLMs進行微調，可以顯著提高安全泛化能力和越獄魯棒性，同時保持可接受的實用性和過度拒絕準確性。AIDSAFE生成的CoT數據集可在以下網址找到：https://huggingface.co/datasets/AmazonScience/AIDSAFE

English

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

邁向大型語言模型的安全推理：以AI代理式審議實現政策嵌入的思維鏈數據創建

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

摘要

Support