大規模言語モデルの高度な指示追従のための推論能力の促進

要旨

既存の大規模言語モデル（LLM）は、複雑な指示に従う際に課題に直面しており、特に複数の制約が並列、連鎖、分岐構造で組織化されている場合に顕著です。直感的な解決策として、思考の連鎖（Chain-of-Thought, CoT）がLLMの能力を普遍的に向上させると期待されています。しかし、我々は、従来のCoTが指示を単に言い換える表面的な推論パターンにより、性能に悪影響を及ぼすことを発見しました。これは、制約の構成を解きほぐし、タイプや次元の階層間での関係を特定するのに失敗しているためです。この問題に対処するため、我々は、テスト時の計算スケーリングにおける推論を促進することで、複雑な指示に対処するLLMの能力を向上させる体系的な手法を提案します。まず、既存の分類体系に基づいて複雑な指示を分解し、再現可能なデータ取得方法を提案します。次に、検証可能なルール中心の報酬信号を用いた強化学習（RL）を活用し、指示に従うための推論能力を育成します。複雑な指示下での浅く非本質的な推論の性質に対処するため、サンプルごとの対比を通じて優れたCoTの強化を図ります。また、エキスパートの行動クローニングを活用し、迅速な思考を行うLLMから熟練した推論者への分布シフトを円滑に進めます。7つの包括的なベンチマークでの広範な評価により、提案手法の有効性が確認され、1.5BパラメータのLLMが8BパラメータのLLMに匹敵する性能を達成し、11.74%の向上を実現しました。コードとデータはhttps://github.com/yuleiqin/RAIFで公開されています。

English

Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data are available at https://github.com/yuleiqin/RAIF.

大規模言語モデルの高度な指示追従のための推論能力の促進

Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

要旨

Support