Light-IF: 複雑な指示追従における一般化可能な推論能力をLLMに付与するためのプレビューと自己チェック手法

要旨

大規模言語モデル（LLM）の推論能力の進展により、数学的問題の解決、コーディングタスク、および一般的なパズルにおけるパフォーマンスが大幅に向上している。しかし、特に複雑な指示に対する正確な指示遵守の効果は一貫していない。本研究では、思考段階における怠惰な推論が、指示遵守の不十分さの主な要因であることを明らかにした。この問題を緩和するため、厳格な指示制約を満たすために不可欠なプレビューと自己チェックを含む厳密な推論プロセスを可能にする包括的なフレームワークを提案する。具体的には、まず複雑な制約を持つ指示を生成し、フィルタリングプロセスを適用して有効なプロンプトを取得し、それらをハード、イージー、パスの3つの異なるプロンプトデータセットに分類する。次に、パスのプロンプトに対してリジェクションサンプリングを行い、高品質で小規模なデータセットをキュレーションし、モデルのコールドスタート初期化と効果的な推論パターンへの適応を促進する。その後、エントロピー保存型の教師ありファインチューニング（Entropy-SFT）戦略と、ルールベースの密な報酬に基づくトークン単位のエントロピー適応型強化学習（TEA-RL）を組み合わせて採用する。このアプローチにより、モデルはその推論メカニズムを変革し、プレビューと自己チェックを含む汎用的な推論能力を育む。指示遵守ベンチマークで実施した広範な実験により、さまざまなモデルスケールにおいて顕著なパフォーマンス向上が実証された。特に、我々のLight-IF-32Bモデルは、DeepSeek-R1のような大規模なオープンソースモデルやDoubao-1.6のようなクローズドソースモデルを上回る結果を示した。

English

While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.

Light-IF: 複雑な指示追従における一般化可能な推論能力をLLMに付与するためのプレビューと自己チェック手法

Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following

要旨

Support