SATQuest：邏輯推理評估與LLM強化微調的驗證器

摘要

近年來，大型語言模型（LLMs）的進展展現了卓越的通用推理能力。然而，由於缺乏可控且可擴展的工具進行細粒度分析，系統性地評估和提升這些推理能力面臨挑戰。現有的基準測試和數據集往往缺乏必要的變量控制，無法進行多維度、系統性的分析和訓練，或者問題類型和格式過於單一。為解決這些限制，我們引入了SATQuest，這是一個系統性驗證工具，旨在通過直接從合取範式（CNF）實例生成多樣化的可滿足性邏輯推理問題，來評估和增強LLMs的邏輯推理能力。SATQuest沿著三個正交維度結構化這些問題：實例規模、問題類型和問題格式，並採用基於SAT的隨機問題生成和通過PySAT進行客觀答案驗證。這一設計緩解了記憶化問題，允許對推理性能進行細緻洞察，並實現了有效的強化微調。我們使用SATQuest對多種LLMs進行了廣泛評估，發現它們在邏輯推理方面存在顯著限制，特別是在超越熟悉的數學格式進行泛化時。此外，我們展示了使用SATQuest獎勵進行強化微調能顯著提升特定任務的表現，並能泛化到更複雜的實例，同時也凸顯了跨格式適應方面的持續挑戰。通過這些展示，我們展示了SATQuest作為基礎工具的潛力，以及其作為推進LLM邏輯推理的寶貴起點。

English

Recent advances in Large Language Models (LLMs) have demonstrated remarkable general reasoning capabilities. However, systematically evaluating and enhancing these reasoning capabilities is challenging due to the lack of controllable and scalable tools for fine-grained analysis. Existing benchmarks and datasets often lack the necessary variable control for multi-dimensional, systematic analysis and training, or have narrow problem types and formats. To address these limitations, we introduce SATQuest, a systematic verifier designed to evaluate and enhance logical reasoning in LLMs by generating diverse, Satisfiability-based logical reasoning problems directly from Conjunctive Normal Form (CNF) instances. SATQuest structures these problems along three orthogonal dimensions: instance scale, problem type, and question format, employing randomized, SAT-based problem generation and objective answer verification via PySAT. This design mitigates memorization issues, allows for nuanced insights into reasoning performance, and enables effective reinforcement fine-tuning. Our extensive evaluation of various LLMs using SATQuest identified significant limitations in their logical reasoning, particularly in generalizing beyond familiar mathematical formats. Furthermore, we show that reinforcement fine-tuning with SATQuest rewards substantially improves targeted task performance and generalizes to more complex instances, while highlighting remaining challenges in cross-format adaptation. Through these demonstrations, we showcase SATQuest's potential as a foundational tool and a valuable starting point for advancing LLM logical reasoning.

SATQuest：邏輯推理評估與LLM強化微調的驗證器

SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs

摘要

Support