GuardReasoner:朝向基於推理的LLM保護措施
GuardReasoner: Towards Reasoning-based LLM Safeguards
January 30, 2025
作者: Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi
cs.AI
摘要
隨著大型語言模型(LLMs)對安全關鍵應用的影響日益增加,利用護欄確保其安全性仍然是一個關鍵挑戰。本文提出了GuardReasoner,一種新的LLMs保護機制,通過引導護欄模型學習推理。具體而言,我們首先創建了GuardReasonerTrain數據集,其中包含127K個樣本,460K個詳細的推理步驟。然後,我們引入推理SFT以發揮護欄模型的推理能力。此外,我們提出了困難樣本DPO以進一步加強其推理能力。通過這種方式,GuardReasoner實現了更好的性能、可解釋性和泛化能力。對3個護欄任務的13個基準進行了廣泛的實驗和分析,證明了其優越性。值得注意的是,GuardReasoner 8B在平均F1分數上超越了GPT-4o+CoT 5.74%,超過LLaMA Guard 3 8B 20.84%。我們釋出了不同規模(1B、3B、8B)的GuardReasoner的訓練數據、代碼和模型:https://github.com/yueliu1999/GuardReasoner/。
English
As LLMs increasingly impact safety-critical applications, ensuring their
safety using guardrails remains a key challenge. This paper proposes
GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to
reason. Concretely, we first create the GuardReasonerTrain dataset, which
consists of 127K samples with 460K detailed reasoning steps. Then, we introduce
reasoning SFT to unlock the reasoning capability of guard models. In addition,
we present hard sample DPO to further strengthen their reasoning ability. In
this manner, GuardReasoner achieves better performance, explainability, and
generalizability. Extensive experiments and analyses on 13 benchmarks of 3
guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B
surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on
average. We release the training data, code, and models with different scales
(1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.Summary
AI-Generated Summary