通過GRPO實現多模態LLM推理的無監督後訓練

摘要

在後訓練階段提升多模態大型語言模型（MLLMs）通常依賴於監督微調（SFT）或強化學習（RL）。然而，這些監督方法需要昂貴且手動標註的多模態數據——這是一種最終不可持續的資源。儘管最近的研究探索了無監督的後訓練，但其方法複雜且難以迭代。在本研究中，我們首次探討了使用GRPO這一穩定且可擴展的在線RL算法，以實現無需外部監督的持續自我改進。我們提出了MM-UPT，這是一個簡單而有效的框架，用於MLLMs的無監督後訓練。MM-UPT基於GRPO，用基於多個採樣響應的多數投票的自我獎勵機制取代了傳統的獎勵信號。我們的實驗表明，MM-UPT顯著提升了Qwen2.5-VL-7B的推理能力（例如，在MathVista上從66.3%提升至72.9%，在We-Math上從62.9%提升至68.7%），使用的是沒有真實標籤的標準數據集。MM-UPT還超越了先前的無監督基線，甚至接近了監督GRPO的結果。此外，我們展示了僅由MLLM自身生成的合成問題也能提升性能，這突顯了一種可擴展自我改進的有前景的方法。總體而言，MM-UPT為在缺乏外部監督的情況下持續自主增強MLLMs提供了一種新範式。我們的代碼可在https://github.com/waltonfuture/MM-UPT獲取。

English

Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %rightarrow72.9 % on MathVista, 62.9 %rightarrow68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.

通過GRPO實現多模態LLM推理的無監督後訓練

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

摘要

Support