當模型自我評判:多模態推理的無監督自我演化
When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
March 22, 2026
作者: Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang, Ni Yang, Qiuying Peng, Luyuan Zhang, Hangrui Xu, Tianhuang Su, Zhenyu Yang, Haonan Lu, Haoqian Wang
cs.AI
摘要
近期多模態大型語言模型的進展在推理任務上取得了顯著成效,但這些改進主要依賴高品質標註數據或教師模型蒸餾,這兩種方法均成本高昂且難以擴展。為解決此問題,我們提出一種無監督自演進訓練框架,用於多模態推理任務,無需使用人工標註答案或外部獎勵模型即可實現穩定的性能提升。針對每個輸入,我們採樣多條推理軌跡並聯合建模其群組內部結構。我們以行動者的自我一致性信號作為訓練先驗,引入基於有界評判器的調控機制,持續對不同品質的軌跡進行重新加權。進一步將調控後的分值建模為群組層級分佈,並將絕對分數轉換為群組內的相對優勢,從而實現更穩健的策略更新。通過在未標註數據上採用群組相對策略優化(GRPO)進行訓練,我們的方法在五個數學推理基準測試中持續提升推理性能與泛化能力,為自演進多模態模型提供了可擴展的發展路徑。程式碼已開源於:https://github.com/OPPO-Mente-Lab/LLM-Self-Judge。
English
Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models. The code are available at https://github.com/OPPO-Mente-Lab/LLM-Self-Judge.