Vielfältige Inferenz und Verifikation für fortgeschrittenes Schlussfolgern

papers.abstract

Reasoning LLMs wie OpenAI o1, o3 und DeepSeek R1 haben bedeutende Fortschritte in Mathematik und Codierung erzielt, finden jedoch fortgeschrittene Aufgaben wie kombinatorische Probleme der Internationalen Mathematik-Olympiade (IMO), Abstraktions- und Schlussfolgerungskorpus (ARC) Rätsel und Fragen des Humanity's Last Exam (HLE) herausfordernd. Wir verwenden einen vielfältigen Inferenzansatz, der mehrere Modelle und Methoden zur Testzeit kombiniert. Wir stellen fest, dass die Überprüfung von Mathematik- und Codeproblemen sowie das Ablehnen von Stichproben bei anderen Problemen einfach und effektiv ist. Wir überprüfen automatisch die Korrektheit von Lösungen für IMO-Probleme mit Lean und ARC-Rätseln mit Code und stellen fest, dass das Best-of-N-Verfahren HLE-Fragen effektiv beantwortet. Unser Ansatz erhöht die Antwortgenauigkeit bei IMO-Kombinatorikproblemen von 33,3% auf 77,8%, die Genauigkeit bei HLE-Fragen von 8% auf 37% und löst 80% der ARC-Rätsel, die 948 Menschen nicht lösen konnten, und 26,5% der ARC-Rätsel, die o3 High Compute nicht löst. Testzeit-Simulationen, Verstärkendes Lernen und Meta-Lernen mit Inferenz-Feedback verbessern die Verallgemeinerung, indem Agentengraph-Repräsentationen angepasst und verschiedene Aufforderungen, Codes und Datensätze variiert werden. Unser Ansatz ist zuverlässig, robust und skalierbar, und im Sinne reproduzierbarer Forschung werden wir ihn nach Veröffentlichung öffentlich zugänglich machen.

English

Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

Vielfältige Inferenz und Verifikation für fortgeschrittenes Schlussfolgern

Diverse Inference and Verification for Advanced Reasoning

papers.abstract

Support