VL-Rethinker: 強化学習を用いた視覚言語モデルの自己反省の促進

要旨

最近、GPT-o1やDeepSeek-R1のような遅考型システムが、明示的な省察を通じて難解な問題を解決する大きな可能性を示しています。これらのシステムは、GPT-4oのような最速の速考型モデルを、様々な数学や科学のベンチマークで大幅に上回っています。しかし、それらのマルチモーダル推論能力は、速考型モデルと同等のままです。例えば、GPT-o1のMathVista、MathVerse、MathVisionなどのベンチマークでの性能は、速考型モデルと似ています。本論文では、蒸留に頼らずに強化学習を用いて、視覚言語モデルの遅考型能力を向上させ、最先端の技術を進歩させることを目指します。まず、GRPOアルゴリズムを、新たな技術であるSelective Sample Replay（SSR）を用いて適応させ、利点消失問題に対処します。このアプローチは強力な性能をもたらしますが、結果として得られたRLトレーニングモデルは、自己省察や自己検証が限られています。さらに遅考型を促進するために、Forced Rethinkingを導入します。これは、RLトレーニングの初期ロールアウトの最後にテキストの再考トリガーを追加し、明示的に自己省察推論ステップを強制します。これら二つの技術を組み合わせることで、我々のモデルVL-Rethinkerは、MathVista、MathVerse、MathVisionでの最先端のスコアをそれぞれ80.3%、61.8%、43.9%に進歩させました。VL-Rethinkerはまた、MMMU-Pro、EMMA、MEGA-Benchなどの多分野ベンチマークでオープンソースのSoTAを達成し、GPT-o1とのギャップを縮めています。

English

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.

VL-Rethinker: 強化学習を用いた視覚言語モデルの自己反省の促進

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

要旨

Support