モデルが自らを判断するとき：マルチモーダル推論のための教師なし自己進化

要旨

近年、マルチモーダル大規模言語モデルの進展により推論タスクの性能が大幅に向上しているが、これらの改善は高品質な注釈データや教師モデルからの蒸頼に依存しており、いずれもコストが高く拡張が困難である。この課題に対処するため、我々は人間による注釈回答や外部報酬モデルを用いずに安定した性能向上を実現する、教師なし自己進化型マルチモーダル推論トレーニングフレームワークを提案する。各入力に対して複数の推論軌道をサンプリングし、それらのグループ内構造を共同でモデル化する。Actorの自己一貫性シグナルを訓練の事前情報として利用し、境界付きJudgeに基づく変調を導入して異なる品質の軌道を継続的に再重み付けする。さらに、変調されたスコアをグループレベルの分布としてモデル化し、絶対スコアを各グループ内の相対的優位性に変換することで、よりロバストな方策更新を可能にする。ラベルなしデータに対するGroup Relative Policy Optimization（GRPO）による訓練により、本手法は5つの数学的推論ベンチマークで推論性能と汎化性能を一貫して向上させ、自己進化型マルチモーダルモデルへのスケーラブルな道筋を提供する。コードはhttps://github.com/OPPO-Mente-Lab/LLM-Self-Judgeで公開されている。

English

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models. The code are available at https://github.com/OPPO-Mente-Lab/LLM-Self-Judge.

モデルが自らを判断するとき：マルチモーダル推論のための教師なし自己進化

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

要旨

Support