GuardReasoner-VL：強化学習による推論を用いた視覚言語モデルの保護

要旨

VLMの安全性を向上させるため、本論文では新しい推論ベースのVLMガードモデル「GuardReasoner-VL」を提案する。中核となるアイデアは、オンライン強化学習（RL）を通じて、ガードモデルがモデレーション決定を行う前に慎重に推論することを促すことである。まず、テキスト、画像、テキスト-画像の入力にまたがる123Kのサンプルと631Kの推論ステップからなる推論コーパス「GuardReasoner-VLTrain」を構築する。次に、これに基づいて、モデルの推論能力をSFT（Supervised Fine-Tuning）でコールドスタートさせる。さらに、オンラインRLを通じてモデレーションに関する推論を強化する。具体的には、サンプルの多様性と難易度を高めるために、提案された安全性を考慮したデータ連結によるデータ拡張を行った後、リジェクトサンプリングを実施する。また、探索を初期段階で促進し、後期段階で活用を促すために、動的なクリッピングパラメータを使用する。性能とトークン効率のバランスを取るために、精度、フォーマット、トークンコストを統合した長さを考慮した安全性報酬を設計する。大規模な実験により、本モデルの優位性が実証された。特に、平均F1スコアで2位のモデルを19.27%上回る結果を示した。GuardReasoner-VLのデータ、コード、およびモデル（3B/7B）をhttps://github.com/yueliu1999/GuardReasoner-VL/で公開している。

English

To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/

GuardReasoner-VL：強化学習による推論を用いた視覚言語モデルの保護

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

要旨

Support