激勵推理以實現大型語言模型的高級指令跟隨能力

摘要

現有的大型語言模型（LLMs）在遵循複雜指令方面面臨挑戰，尤其是在存在多個約束條件且這些條件以並行、鏈式和分支結構組織時。一種直觀的解決方案，即思維鏈（CoT），被期望能普遍提升LLMs的能力。然而，我們發現，由於其僅僅是對指令進行表面上的重述，這種基礎的CoT反而對性能產生了負面影響。它未能深入剖析約束條件的組成，以識別它們在不同類型和層次維度上的關係。為此，我們提出了一種系統性的方法，通過激勵推理來提升LLMs處理複雜指令的能力，並實現測試時的計算規模化。首先，我們基於現有的分類體系對複雜指令進行分解，並提出了一種可重複的數據獲取方法。其次，我們利用強化學習（RL）結合可驗證的規則中心獎勵信號，專門培養遵循指令的推理能力。我們通過樣本間的對比來解決複雜指令下推理的淺層和非本質性問題，從而實現更優的CoT執行。此外，我們還利用專家行為克隆來促進從快速思考的LLMs向熟練推理者的穩健分佈轉移。在七個綜合基準上的廣泛評估證實了所提方法的有效性，其中一個1.5B的LLM實現了11.74%的性能提升，其表現可與8B的LLM相媲美。代碼和數據可在https://github.com/yuleiqin/RAIF 獲取。

English

Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data are available at https://github.com/yuleiqin/RAIF.

激勵推理以實現大型語言模型的高級指令跟隨能力

Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

摘要

Support