Vision-R1: 大規模視覚言語モデルにおけるヒューマンフリーなアライメントの進化 - 視覚誘導型強化学習によるアプローチ -

要旨

大規模視覚言語モデル（LVLM）は通常、事前学習と教師あり微調整という2段階の訓練パラダイムに従います。最近、言語領域から派生した選好最適化が、LVLMの能力を向上させる効果的な訓練後強化戦略として登場しました。しかし、高品質な人間による注釈付き選好データの構築と、これらの選好を模倣する堅牢な報酬モデルの開発は、いずれもコストがかかり困難です。この観察に動機づけられ、私たちはVision-R1を提案します。これは、決定的な視覚フィードバックでモデルを報酬する、LVLM向けの新しい視覚誘導型R1風強化学習アルゴリズムです。これはキュレーションされた指示データのみを活用し、専門的な報酬モデルや手作り選好データセットの必要性を排除します。私たちは、視覚タスクのロジックに基づいてモデルの完成度を包括的に評価するために、多次元フィードバックをさらに統合する基準駆動型報酬関数を組み込みます。さらに、訓練中に報酬基準を動的に調整する漸進的ルール改良戦略を導入し、継続的なモデル改善を可能にし、報酬ハッキングを軽減します。分布内および分布外ベンチマークでの広範な実験により、7B LVLMをVision-R1で微調整することで、一貫した性能向上が達成され、最大50％の改善と、最先端の10倍サイズモデルを凌駕することが実証されました。

English

Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.

Vision-R1: 大規模視覚言語モデルにおけるヒューマンフリーなアライメントの進化 - 視覚誘導型強化学習によるアプローチ -

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

要旨

Support