当模型自我评判：多模态推理的无监督自我演进

摘要

近期，多模态大语言模型的发展显著提升了推理任务性能，但这些改进主要依赖高质量标注数据或教师模型蒸馏，两者均成本高昂且难以规模化。为此，我们提出一种无需监督的自演进多模态推理训练框架，在不使用人工标注答案或外部奖励模型的情况下实现稳定的性能提升。对于每个输入，我们采样多条推理轨迹并联合建模其组内结构。我们采用行动者模型的自一致性信号作为训练先验，引入基于有界评判者的调节机制持续重加权不同质量的轨迹。进一步将调节后的分数建模为组级分布，并将绝对分数转换为组内相对优势，从而实现更稳健的策略更新。通过在无标注数据上采用组相对策略优化（GRPO）进行训练，我们的方法在五个数学推理基准测试中持续提升推理性能与泛化能力，为自演进多模态模型提供了可扩展的路径。代码已开源：https://github.com/OPPO-Mente-Lab/LLM-Self-Judge。

English

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models. The code are available at https://github.com/OPPO-Mente-Lab/LLM-Self-Judge.