Light-IF：通過預覽與自我校驗賦能LLMs，實現複雜指令遵循的泛化推理能力

摘要

儘管大型語言模型（LLMs）在推理能力上的進步顯著提升了其在解決數學問題、編碼任務及一般謎題上的表現，但在準確遵循指令方面，尤其是面對更複雜的指示時，其效果仍不穩定。我們的研究發現，思維階段的懶惰推理是導致指令遵循不佳的主要原因。為緩解這一問題，我們提出了一個全面的框架，旨在實現嚴謹的推理過程，包括預覽與自我檢查，這對於滿足嚴格的指令約束至關重要。具體而言，我們首先生成具有複雜約束的指令，並通過過濾過程獲取有效的提示，從而得到三個不同類別的提示數據集：困難、簡單和通過。接著，我們對通過的提示進行拒絕採樣，精選出一個小而高質量的數據集，用於模型的冷啟動初始化，並促進其適應有效的推理模式。隨後，我們採用熵保持的監督微調（Entropy-SFT）策略，結合基於規則的密集獎勵引導的逐詞熵適應（TEA-RL）強化學習。這一方法鼓勵模型轉變其推理機制，最終培養出包含預覽與自我檢查的可泛化推理能力。在指令遵循基準上進行的大量實驗顯示，各模型規模均取得了顯著的性能提升。值得注意的是，我們的Light-IF-32B模型超越了如DeepSeek-R1等更大的開源模型以及Doubao-1.6等閉源模型。

English

While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.

Light-IF：通過預覽與自我校驗賦能LLMs，實現複雜指令遵循的泛化推理能力

Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following

摘要

Support