CARFT:通過基於註釋的思維鏈強化微調的對比學習提升大語言模型推理能力
CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
August 21, 2025
作者: Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang
cs.AI
摘要
推理能力在大型語言模型(LLMs)的廣泛應用中扮演著至關重要的角色。為了提升LLMs的推理表現,多種基於強化學習(RL)的微調方法被提出,以解決僅通過監督式微調(SFT)訓練的LLMs在泛化能力上的限制。儘管這些方法有效,但兩大主要限制阻礙了LLMs的進一步發展。首先,傳統的RL方法忽略了註解的思維鏈(CoT),並採用了不穩定的推理路徑採樣,這通常導致模型崩潰、訓練過程不穩定以及次優表現。其次,現有的SFT方法普遍過度強調註解的CoT,可能因未能充分利用潛在的CoT而導致性能下降。本文提出了一種基於註解CoT的對比學習強化微調方法,即,以提升LLMs的推理表現,同時解決上述限制。具體而言,我們提出為每個CoT學習一個表示,並基於此表示設計新穎的對比信號來指導微調過程。我們的方法不僅充分利用了可用的註解CoT,還通過引入額外的無監督學習信號來穩定微調程序。我們進行了全面的實驗和深入分析,與三種基線方法、兩種基礎模型和兩個數據集進行比較,展示了在魯棒性、性能(提升至10.15%)和效率(提升至30.62%)方面的顯著優勢。代碼可在https://github.com/WNQzhu/CARFT 獲取。
English
Reasoning capability plays a significantly critical role in the the broad
applications of Large Language Models (LLMs). To enhance the reasoning
performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning
approaches have been proposed to address the limited generalization capability
of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their
effectiveness, two major limitations hinder the advancement of LLMs. First,
vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and
incorporate unstable reasoning path sampling, which typically results in model
collapse, unstable training process, and suboptimal performance. Second,
existing SFT approaches generally overemphasize the annotated CoT, potentially
leading to performance degradation due to insufficient exploitation of
potential CoT. In this paper, we propose a Contrastive learning with annotated
CoT-based Reinforced Fine-Tuning approach, i.e., , to enhance the
reasoning performance of LLMs while addressing the aforementioned limitations.
Specifically, we propose learning a representation for each CoT. Based on this
representation, we design novel contrastive signals to guide the fine-tuning
process. Our approach not only fully exploits the available annotated CoT but
also stabilizes the fine-tuning procedure by incorporating an additional
unsupervised learning signal. We conduct comprehensive experiments and in-depth
analysis with three baseline approaches, two foundation models, and two
datasets to demonstrate significant advantages of in terms of
robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code
is available at https://github.com/WNQzhu/CARFT.