URSA: Begrip en Verificatie van Keten-van-denken Redenering in Multimodale Wiskunde

Samenvatting

Keten-van-gedachten (CoT) redenering is wijdverspreid toegepast in het wiskundig redeneren van Grote Taalmodellen (LLMs). Onlangs heeft de introductie van afgeleide procesbegeleiding op CoT-trajecten discussies aangewakkerd over het verbeteren van schaalvermogen tijdens testtijd, waardoor het potentieel van deze modellen wordt versterkt. Echter, bij multimodaal wiskundig redeneren heeft de schaarste aan hoogwaardige CoT-trainingsdata bestaande modellen belemmerd om hoogwaardige CoT-redenering te bereiken en heeft het de realisatie van redeneerpotentieel tijdens testtijd beperkt. In dit werk stellen we een drie-module synthese strategie voor die CoT-distantiëring, traject-formaat herschrijven en formaat-unificatie integreert. Dit resulteert in een hoogwaardige CoT-redeneerinstructie fijnafstemmingsdataset in multimodale wiskunde, MMathCoT-1M. We valideren uitgebreid de state-of-the-art (SOTA) prestaties van het getrainde URSA-7B model op meerdere multimodale wiskundige benchmarks. Voor schaalvermogen tijdens testtijd introduceren we een gegevenssynthese strategie die automatisch procesannotatiedatasets genereert, bekend als DualMath-1.1M, gericht op zowel interpretatie als logica. Door URSA-7B verder te trainen op DualMath-1.1M, maken we de overgang van CoT-redeneervermogen naar robuuste begeleidingsmogelijkheden. De getrainde URSA-RM-7B fungeert als een verifier, waarbij effectief de prestaties van URSA-7B tijdens testtijd worden verbeterd. URSA-RM-7B toont ook uitstekende out-of-distribution (OOD) verificatiemogelijkheden, waarbij het generalisatie aantoont. Modelgewichten, trainingsgegevens en code zullen open-source worden gemaakt.

English

Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.

URSA: Begrip en Verificatie van Keten-van-denken Redenering in Multimodale Wiskunde

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Samenvatting

Summary

Support

Support