ZebraLogic：關於LLM在邏輯推理中的擴展極限

摘要

我們研究了大型語言模型（LLMs）的邏輯推理能力以及它們在複雜非單調推理中的可擴展性。為此，我們引入了ZebraLogic，這是一個全面的評估框架，用於評估LLM在從約束滿足問題（CSPs）衍生的邏輯網格謎題上的推理表現。ZebraLogic能夠生成具有可控和可量化複雜性的謎題，有助於系統性地研究Llama、o1模型和DeepSeek-R1等模型的擴展極限。通過涵蓋廣泛的搜索空間複雜性和多樣的邏輯約束，ZebraLogic提供了一個結構化環境，以評估在不斷增加的困難下的推理能力。我們的結果顯示，隨著問題複雜性的增加，準確性顯著下降，這種現象我們稱之為複雜性的詛咒。即使使用更大的模型和增加的推理時間計算，這種限制仍然存在，表明目前LLM推理能力中存在固有的限制。此外，我們探索了增強邏輯推理的策略，包括最佳N抽樣、回溯機制和自我驗證提示。我們的研究結果提供了對LLM推理可擴展性的關鍵見解，突出了基本限制，並概述了改進的潛在方向。

English

We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

ZebraLogic：關於LLM在邏輯推理中的擴展極限

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

摘要

Support