GRPOによるマルチモーダルLLM推論のための教師なしポストトレーニング

要旨

マルチモーダル大規模言語モデル（MLLM）のポストトレーニング段階での改善は、通常、教師ありファインチューニング（SFT）または強化学習（RL）に依存しています。しかし、これらの教師あり手法は、高価で手動でアノテーションされたマルチモーダルデータを必要とし、最終的には持続不可能なリソースです。最近の研究では、教師なしポストトレーニングが探求されていますが、その手法は複雑で反復が困難です。本研究では、外部の監督なしで継続的な自己改善を可能にするために、安定かつスケーラブルなオンラインRLアルゴリズムであるGRPOの使用を初めて調査します。我々は、MLLMの教師なしポストトレーニングのためのシンプルでありながら効果的なフレームワークであるMM-UPTを提案します。MM-UPTはGRPOを基盤としており、従来の報酬信号を、複数のサンプル応答に対する多数決に基づく自己報酬メカニズムに置き換えます。実験結果は、MM-UPTがQwen2.5-VL-7Bの推論能力を大幅に向上させることを示しています（例：MathVistaで66.3%→72.9%、We-Mathで62.9%→68.7%）。これは、グラウンドトゥルースラベルなしの標準データセットを使用して達成されました。MM-UPTは、従来の教師なしベースラインを上回り、教師ありGRPOの結果に近づくことさえあります。さらに、MLLM自体によって生成された合成質問を取り入れることで、パフォーマンスをさらに向上させることができることも示しており、スケーラブルな自己改善の有望なアプローチを強調しています。全体として、MM-UPTは、外部の監督なしでMLLMの継続的かつ自律的な強化のための新しいパラダイムを提供します。我々のコードはhttps://github.com/waltonfuture/MM-UPTで公開されています。

English

Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %rightarrow72.9 % on MathVista, 62.9 %rightarrow68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.

GRPOによるマルチモーダルLLM推論のための教師なしポストトレーニング

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

要旨

Support