通過GRPO實現多模態LLM推理的無監督後訓練
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
May 28, 2025
作者: Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
cs.AI
摘要
在後訓練階段提升多模態大型語言模型(MLLMs)通常依賴於監督微調(SFT)或強化學習(RL)。然而,這些監督方法需要昂貴且手動標註的多模態數據——這是一種最終不可持續的資源。儘管最近的研究探索了無監督的後訓練,但其方法複雜且難以迭代。在本研究中,我們首次探討了使用GRPO這一穩定且可擴展的在線RL算法,以實現無需外部監督的持續自我改進。我們提出了MM-UPT,這是一個簡單而有效的框架,用於MLLMs的無監督後訓練。MM-UPT基於GRPO,用基於多個採樣響應的多數投票的自我獎勵機制取代了傳統的獎勵信號。我們的實驗表明,MM-UPT顯著提升了Qwen2.5-VL-7B的推理能力(例如,在MathVista上從66.3%提升至72.9%,在We-Math上從62.9%提升至68.7%),使用的是沒有真實標籤的標準數據集。MM-UPT還超越了先前的無監督基線,甚至接近了監督GRPO的結果。此外,我們展示了僅由MLLM自身生成的合成問題也能提升性能,這突顯了一種可擴展自我改進的有前景的方法。總體而言,MM-UPT為在缺乏外部監督的情況下持續自主增強MLLMs提供了一種新範式。我們的代碼可在https://github.com/waltonfuture/MM-UPT獲取。
English
Improving Multi-modal Large Language Models (MLLMs) in the post-training
stage typically relies on supervised fine-tuning (SFT) or reinforcement
learning (RL). However, these supervised methods require expensive and manually
annotated multi-modal data--an ultimately unsustainable resource. While recent
efforts have explored unsupervised post-training, their methods are complex and
difficult to iterate. In this work, we are the first to investigate the use of
GRPO, a stable and scalable online RL algorithm, for enabling continual
self-improvement without any external supervision. We propose MM-UPT, a simple
yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds
upon GRPO, replacing traditional reward signals with a self-rewarding mechanism
based on majority voting over multiple sampled responses. Our experiments
demonstrate that MM-UPT significantly improves the reasoning ability of
Qwen2.5-VL-7B (e.g., 66.3 %rightarrow72.9 % on MathVista, 62.9
%rightarrow68.7 % on We-Math), using standard dataset without ground truth
labels. MM-UPT also outperforms prior unsupervised baselines and even
approaches the results of supervised GRPO. Furthermore, we show that
incorporating synthetic questions, generated solely by MLLM itself, can boost
performance as well, highlighting a promising approach for scalable
self-improvement. Overall, MM-UPT offers a new paradigm for continual,
autonomous enhancement of MLLMs in the absence of external supervision. Our
code is available at https://github.com/waltonfuture/MM-UPT.Summary
AI-Generated Summary