ChatPaper.aiChatPaper

GuardReasoner-VL:通过强化推理保护视觉语言模型

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

May 16, 2025
作者: Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi
cs.AI

摘要

为提升视觉语言模型(VLM)的安全性,本文提出了一种新颖的基于推理的VLM防护模型,命名为GuardReasoner-VL。其核心理念是通过在线强化学习(RL)激励防护模型在做出审核决策前进行深思熟虑的推理。首先,我们构建了GuardReasoner-VLTrain,这是一个包含123K样本和631K推理步骤的推理语料库,涵盖文本、图像及图文混合输入。随后,基于此,我们通过监督微调(SFT)冷启动模型的推理能力。此外,我们进一步通过在线RL增强模型在审核方面的推理能力。具体而言,为提升样本的多样性和难度,我们实施了拒绝采样,并通过提出的安全感知数据拼接进行数据增强。同时,采用动态裁剪参数以鼓励早期阶段的探索和后期阶段的利用。为平衡性能与令牌效率,我们设计了一种长度感知的安全奖励机制,综合了准确性、格式及令牌成本。大量实验验证了我们模型的优越性,其平均F1分数显著超越第二名19.27%。我们已在https://github.com/yueliu1999/GuardReasoner-VL/发布了GuardReasoner-VL的数据、代码及模型(3B/7B)。
English
To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/

Summary

AI-Generated Summary

PDF485May 19, 2025