Heimdall: test-tijd schaling bij de generatieve verificatie

Samenvatting

Een AI-systeem kan kennis alleen creëren en behouden voor zover het die kennis zelf kan verifiëren. Recent onderzoek naar lange Chain-of-Thought-redeneringen heeft het grote potentieel van LLM's aangetoond bij het oplossen van competitieve problemen, maar hun verificatievermogen blijft zwak en is nog onvoldoende onderzocht. In dit artikel introduceren we Heimdall, de lange CoT-verificatie-LLM die de juistheid van oplossingen nauwkeurig kan beoordelen. Met puur reinforcement learning verhogen we de verificatienauwkeurigheid van 62,5% naar 94,5% bij competitieve wiskundeproblemen. Door schaling met herhaalde steekproeven neemt de nauwkeurigheid verder toe tot 97,5%. Door middel van menselijke evaluatie toont Heimdall indrukwekkende generalisatiecapaciteiten, waarbij het de meeste problemen in uitdagende wiskundebewijzen succesvol detecteert, een type dat niet tijdens de training is opgenomen. Bovendien stellen we Pessimistische Verificatie voor om de functionaliteit van Heimdall uit te breiden naar het opschalen van probleemoplossing. Het roept Heimdall op om de oplossingen van een oplossingsmodel te beoordelen en selecteert op basis van het pessimistische principe de meest waarschijnlijk correcte oplossing met de minste onzekerheid. Met DeepSeek-R1-Distill-Qwen-32B als oplossingsmodel verbetert Pessimistische Verificatie de oplossingsnauwkeurigheid op AIME2025 van 54,2% naar 70,0% met een 16x rekenbudget en naar 83,3% met meer rekenbudget. Met het sterkere oplossingsmodel Gemini 2.5 Pro bereikt de score 93,0%. Ten slotte prototypen we een automatisch kennisontdekkingssysteem, een ternair systeem waarbij één component vragen stelt, een ander oplossingen biedt en de derde de oplossingen verifieert. Met behulp van de datasynthese NuminaMath voor de eerste twee componenten identificeert Heimdall effectief problematische records binnen de dataset en onthult dat bijna de helft van de data gebreken vertoont, wat interessant genoeg overeenkomt met recente ablatiestudies van NuminaMath.

English

An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.

Heimdall: test-tijd schaling bij de generatieve verificatie

Heimdall: test-time scaling on the generative verification

Samenvatting

Support