ChatPaper.aiChatPaper

CARFT:通過基於註釋的思維鏈強化微調的對比學習提升大語言模型推理能力

CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

August 21, 2025
作者: Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang
cs.AI

摘要

推理能力在大型語言模型(LLMs)的廣泛應用中扮演著至關重要的角色。為了提升LLMs的推理表現,多種基於強化學習(RL)的微調方法被提出,以解決僅通過監督式微調(SFT)訓練的LLMs在泛化能力上的限制。儘管這些方法有效,但兩大主要限制阻礙了LLMs的進一步發展。首先,傳統的RL方法忽略了註解的思維鏈(CoT),並採用了不穩定的推理路徑採樣,這通常導致模型崩潰、訓練過程不穩定以及次優表現。其次,現有的SFT方法普遍過度強調註解的CoT,可能因未能充分利用潛在的CoT而導致性能下降。本文提出了一種基於註解CoT的對比學習強化微調方法,即,以提升LLMs的推理表現,同時解決上述限制。具體而言,我們提出為每個CoT學習一個表示,並基於此表示設計新穎的對比信號來指導微調過程。我們的方法不僅充分利用了可用的註解CoT,還通過引入額外的無監督學習信號來穩定微調程序。我們進行了全面的實驗和深入分析,與三種基線方法、兩種基礎模型和兩個數據集進行比較,展示了在魯棒性、性能(提升至10.15%)和效率(提升至30.62%)方面的顯著優勢。代碼可在https://github.com/WNQzhu/CARFT 獲取。
English
Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., , to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at https://github.com/WNQzhu/CARFT.
PDF03August 25, 2025