Wenn Modelle sich selbst beurteilen: Unüberwachte Selbstevolution für multimodales Reasoning

Zusammenfassung

Jüngste Fortschritte bei multimodalen großen Sprachmodellen haben zu einer starken Leistung bei Reasoning-Aufgaben geführt, doch diese Verbesserungen beruhen weitgehend auf hochwertigen annotierten Daten oder Teacher-Model-Distillation, die beide kostspielig und schwer zu skalieren sind. Um dies zu lösen, schlagen wir einen unüberwachten Selbstevolutionstrainingsrahmen für multimodales Reasoning vor, der stabile Leistungsverbesserungen erzielt, ohne menschlich annotierte Antworten oder externe Belohnungsmodelle zu verwenden. Für jede Eingabe sammeln wir mehrere Reasoning-Pfade und modellieren gemeinsam ihre Struktur innerhalb der Gruppe. Wir nutzen das Selbstkonsistenzsignal des Actors als Trainingsprior und führen eine begrenzte Judge-basierte Modulation ein, um Pfade unterschiedlicher Qualität kontinuierlich neu zu gewichten. Wir modellieren die modulierten Scores weiter als eine Gruppenverteilung und wandeln absolute Scores in relative Vorteile innerhalb jeder Gruppe um, was robustere Policy-Updates ermöglicht. Durch Training mit Group Relative Policy Optimization (GRPO) auf nicht annotierten Daten verbessert unsere Methode durchgängig die Reasoning-Leistung und Generalisierung auf fünf mathematischen Reasoning-Benchmarks und bietet einen skalierbaren Weg zu sich selbst entwickelnden multimodalen Modellen. Der Code ist verfügbar unter https://github.com/OPPO-Mente-Lab/LLM-Self-Judge.

English

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models. The code are available at https://github.com/OPPO-Mente-Lab/LLM-Self-Judge.

Wenn Modelle sich selbst beurteilen: Unüberwachte Selbstevolution für multimodales Reasoning

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

Zusammenfassung

Support