ZebraLogic:關於LLM在邏輯推理中的擴展極限
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
February 3, 2025
作者: Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
cs.AI
摘要
我們研究了大型語言模型(LLMs)的邏輯推理能力以及它們在複雜非單調推理中的可擴展性。為此,我們引入了ZebraLogic,這是一個全面的評估框架,用於評估LLM在從約束滿足問題(CSPs)衍生的邏輯網格謎題上的推理表現。ZebraLogic能夠生成具有可控和可量化複雜性的謎題,有助於系統性地研究Llama、o1模型和DeepSeek-R1等模型的擴展極限。通過涵蓋廣泛的搜索空間複雜性和多樣的邏輯約束,ZebraLogic提供了一個結構化環境,以評估在不斷增加的困難下的推理能力。
我們的結果顯示,隨著問題複雜性的增加,準確性顯著下降,這種現象我們稱之為複雜性的詛咒。即使使用更大的模型和增加的推理時間計算,這種限制仍然存在,表明目前LLM推理能力中存在固有的限制。此外,我們探索了增強邏輯推理的策略,包括最佳N抽樣、回溯機制和自我驗證提示。我們的研究結果提供了對LLM推理可擴展性的關鍵見解,突出了基本限制,並概述了改進的潛在方向。
English
We investigate the logical reasoning capabilities of large language models
(LLMs) and their scalability in complex non-monotonic reasoning. To this end,
we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM
reasoning performance on logic grid puzzles derived from constraint
satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with
controllable and quantifiable complexity, facilitating a systematic study of
the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By
encompassing a broad range of search space complexities and diverse logical
constraints, ZebraLogic provides a structured environment to evaluate reasoning
under increasing difficulty.
Our results reveal a significant decline in accuracy as problem complexity
grows -- a phenomenon we term the curse of complexity. This limitation persists
even with larger models and increased inference-time computation, suggesting
inherent constraints in current LLM reasoning capabilities. Additionally, we
explore strategies to enhance logical reasoning, including Best-of-N sampling,
backtracking mechanisms, and self-verification prompts. Our findings offer
critical insights into the scalability of LLM reasoning, highlight fundamental
limitations, and outline potential directions for improvement.Summary
AI-Generated Summary