SATQuest: 論理的推論評価と強化学習のための検証ツール LLMのファインチューニング

要旨

大規模言語モデル（LLMs）の最近の進展は、顕著な汎用推論能力を示している。しかし、これらの推論能力を体系的に評価し向上させることは、細粒度の分析のための制御可能かつスケーラブルなツールの不足により困難である。既存のベンチマークやデータセットは、多次元的で体系的な分析とトレーニングに必要な変数制御を欠いているか、問題の種類や形式が限定的である。これらの課題に対処するため、我々はSATQuestを導入する。これは、連言標準形（CNF）インスタンスから直接、多様な充足可能性に基づく論理推論問題を生成することで、LLMsの論理推論を評価し向上させる体系的な検証ツールである。SATQuestは、インスタンスの規模、問題の種類、質問形式という3つの直交する次元に沿ってこれらの問題を構造化し、ランダム化されたSATベースの問題生成とPySATを用いた客観的な回答検証を採用する。この設計により、記憶化の問題を軽減し、推論性能に関する微妙な洞察を得ることが可能となり、効果的な強化学習による微調整を可能にする。SATQuestを用いた各種LLMsの広範な評価により、特に馴染みのある数学的形式を超えた一般化において、論理推論に重大な制限があることが明らかとなった。さらに、SATQuestの報酬を用いた強化学習による微調整が、特定のタスク性能を大幅に向上させ、より複雑なインスタンスへの一般化を可能にすることも示した。一方で、形式間の適応における残された課題も浮き彫りにした。これらの実証を通じて、SATQuestがLLMの論理推論を進展させるための基盤ツールおよび貴重な出発点としての可能性を示す。

English

Recent advances in Large Language Models (LLMs) have demonstrated remarkable general reasoning capabilities. However, systematically evaluating and enhancing these reasoning capabilities is challenging due to the lack of controllable and scalable tools for fine-grained analysis. Existing benchmarks and datasets often lack the necessary variable control for multi-dimensional, systematic analysis and training, or have narrow problem types and formats. To address these limitations, we introduce SATQuest, a systematic verifier designed to evaluate and enhance logical reasoning in LLMs by generating diverse, Satisfiability-based logical reasoning problems directly from Conjunctive Normal Form (CNF) instances. SATQuest structures these problems along three orthogonal dimensions: instance scale, problem type, and question format, employing randomized, SAT-based problem generation and objective answer verification via PySAT. This design mitigates memorization issues, allows for nuanced insights into reasoning performance, and enables effective reinforcement fine-tuning. Our extensive evaluation of various LLMs using SATQuest identified significant limitations in their logical reasoning, particularly in generalizing beyond familiar mathematical formats. Furthermore, we show that reinforcement fine-tuning with SATQuest rewards substantially improves targeted task performance and generalizes to more complex instances, while highlighting remaining challenges in cross-format adaptation. Through these demonstrations, we showcase SATQuest's potential as a foundational tool and a valuable starting point for advancing LLM logical reasoning.

SATQuest: 論理的推論評価と強化学習のための検証ツール LLMのファインチューニング

SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs

要旨

Support