Schulung von Schritt-für-Schritt-Argumentationsverifizierern mit formalen Verifikationswerkzeugen

papers.abstract

Prozess-Belohnungsmodelle (PRMs), die schrittweise Rückmeldungen zu den von großen Sprachmodellen (LLMs) generierten Schlussfolgerungen liefern, erhalten zunehmend Aufmerksamkeit. Allerdings bestehen weiterhin zwei zentrale Forschungslücken: Die Sammlung genauer Fehlerlabels auf Schrittebene für das Training erfordert in der Regel kostspielige menschliche Annotation, und bestehende PRMs beschränken sich auf mathematische Schlussfolgerungsprobleme. Als Antwort auf diese Lücken zielt dieser Artikel darauf ab, die Herausforderungen der automatischen Datensatzerstellung und der Generalisierung von PRMs auf vielfältige Schlussfolgerungsaufgaben zu adressieren. Um dieses Ziel zu erreichen, schlagen wir FoVer vor, einen Ansatz zur Schulung von PRMs anhand von Schrittebene-Fehlerlabels, die automatisch durch formale Verifikationstools wie Z3 für formale Logik und Isabelle für Theorembeweise annotiert werden. Diese Tools bieten eine automatische und genaue Verifikation für symbolische Aufgaben. Mit diesem Ansatz synthetisieren wir einen Trainingsdatensatz mit Fehlerlabels für LLM-Antworten zu Aufgaben der formalen Logik und Theorembeweise ohne menschliche Annotation. Obwohl diese Datensynthese nur für Aufgaben möglich ist, die mit formaler Verifikation kompatibel sind, beobachten wir, dass auf unserem Datensatz trainierte LLM-basierte PRMs eine übergreifende Generalisierung zeigen und die Verifikation über verschiedene Schlussfolgerungsaufgaben hinweg verbessern. Insbesondere übertreffen mit FoVer trainierte PRMs die Baseline-PRMs, die auf den ursprünglichen LLMs basieren, deutlich und erzielen wettbewerbsfähige oder überlegene Ergebnisse im Vergleich zu state-of-the-art PRMs, die mit von Menschen oder stärkeren Modellen annotierten Labels trainiert wurden. Dies wird durch die Schrittebene-Verifikation auf ProcessBench und die Best-of-K-Leistung über 12 Schlussfolgerungs-Benchmarks, darunter MATH, AIME, ANLI, MMLU und BBH, gemessen. Die Datensätze, Modelle und der Code sind unter https://github.com/psunlpgroup/FoVer verfügbar.

English

Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.

Schulung von Schritt-für-Schritt-Argumentationsverifizierern mit formalen Verifikationswerkzeugen

Training Step-Level Reasoning Verifiers with Formal Verification Tools

papers.abstract

Support