SATQuest：逻辑推理评估与强化的验证器 ——大语言模型的微调工具

摘要

近期，大型语言模型（LLMs）的进展展现了其卓越的通用推理能力。然而，由于缺乏可控且可扩展的细粒度分析工具，系统性地评估和提升这些推理能力面临挑战。现有的基准测试和数据集往往缺少多维、系统性分析和训练所需的变量控制，或局限于狭窄的问题类型和格式。为应对这些局限，我们推出了SATQuest，一个系统性验证器，旨在通过直接从合取范式（CNF）实例生成多样化的基于可满足性的逻辑推理问题，来评估和增强LLMs的逻辑推理能力。SATQuest沿三个正交维度构建这些问题：实例规模、问题类型和提问格式，采用随机化的基于SAT的问题生成方法，并通过PySAT进行客观答案验证。这一设计缓解了记忆问题，允许对推理性能进行细致入微的洞察，并支持有效的强化微调。我们利用SATQuest对多种LLMs进行了广泛评估，发现它们在逻辑推理方面存在显著局限，尤其是在超越熟悉的数学格式进行泛化时。此外，我们展示了使用SATQuest奖励进行强化微调能显著提升目标任务表现，并推广至更复杂的实例，同时揭示了跨格式适应方面仍存的挑战。通过这些展示，我们彰显了SATQuest作为基础工具的潜力，以及其作为推进LLM逻辑推理研究宝贵起点的价值。

English

Recent advances in Large Language Models (LLMs) have demonstrated remarkable general reasoning capabilities. However, systematically evaluating and enhancing these reasoning capabilities is challenging due to the lack of controllable and scalable tools for fine-grained analysis. Existing benchmarks and datasets often lack the necessary variable control for multi-dimensional, systematic analysis and training, or have narrow problem types and formats. To address these limitations, we introduce SATQuest, a systematic verifier designed to evaluate and enhance logical reasoning in LLMs by generating diverse, Satisfiability-based logical reasoning problems directly from Conjunctive Normal Form (CNF) instances. SATQuest structures these problems along three orthogonal dimensions: instance scale, problem type, and question format, employing randomized, SAT-based problem generation and objective answer verification via PySAT. This design mitigates memorization issues, allows for nuanced insights into reasoning performance, and enables effective reinforcement fine-tuning. Our extensive evaluation of various LLMs using SATQuest identified significant limitations in their logical reasoning, particularly in generalizing beyond familiar mathematical formats. Furthermore, we show that reinforcement fine-tuning with SATQuest rewards substantially improves targeted task performance and generalizes to more complex instances, while highlighting remaining challenges in cross-format adaptation. Through these demonstrations, we showcase SATQuest's potential as a foundational tool and a valuable starting point for advancing LLM logical reasoning.

SATQuest：逻辑推理评估与强化的验证器 ——大语言模型的微调工具

SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs

摘要

Support