GRPO를 통한 다중 모드 LLM 추론을 위한 비지도 사후 학습

초록

포스트 트레이닝 단계에서 멀티모달 대형 언어 모델(MLLM)을 개선하는 것은 일반적으로 지도 미세 조정(SFT) 또는 강화 학습(RL)에 의존합니다. 그러나 이러한 지도 방식은 비용이 많이 들고 수동으로 주석이 달린 멀티모달 데이터를 필요로 하며, 이는 궁극적으로 지속 가능하지 않은 자원입니다. 최근에는 비지도 포스트 트레이닝을 탐구한 연구들이 있지만, 그 방법들은 복잡하고 반복하기 어렵습니다. 본 연구에서는 외부 감독 없이 지속적인 자기 개선을 가능하게 하는 안정적이고 확장 가능한 온라인 RL 알고리즘인 GRPO의 사용을 처음으로 조사합니다. 우리는 MLLM의 비지도 포스트 트레이닝을 위한 간단하지만 효과적인 프레임워크인 MM-UPT를 제안합니다. MM-UPT는 GRPO를 기반으로 하며, 전통적인 보상 신호를 여러 샘플 응답에 대한 다수결 투표를 기반으로 한 자기 보상 메커니즘으로 대체합니다. 우리의 실험은 MM-UPT가 Qwen2.5-VL-7B의 추론 능력을 크게 향상시킴을 보여줍니다(예: MathVista에서 66.3% → 72.9%, We-Math에서 62.9% → 68.7%). 이는 정답 레이블이 없는 표준 데이터셋을 사용하여 이루어졌습니다. MM-UPT는 또한 기존의 비지도 베이스라인을 능가하며, 심지어 지도 GRPO의 결과에 근접합니다. 더 나아가, MLLM 자체에 의해 생성된 합성 질문을 통합하면 성능을 더욱 향상시킬 수 있음을 보여주며, 이는 확장 가능한 자기 개선을 위한 유망한 접근 방식을 강조합니다. 전반적으로, MM-UPT는 외부 감독 없이 MLLM의 지속적이고 자율적인 개선을 위한 새로운 패러다임을 제공합니다. 우리의 코드는 https://github.com/waltonfuture/MM-UPT에서 확인할 수 있습니다.

English

Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %rightarrow72.9 % on MathVista, 62.9 %rightarrow68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.

GRPO를 통한 다중 모드 LLM 추론을 위한 비지도 사후 학습

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

초록

Support