CARFT:通过基于标注思维链的对比学习与强化微调提升大语言模型推理能力
CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
August 21, 2025
作者: Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang
cs.AI
摘要
推理能力在大规模语言模型(LLMs)的广泛应用中扮演着至关重要的角色。为了提升LLMs的推理性能,多种基于强化学习(RL)的微调方法被提出,以解决仅通过监督微调(SFT)训练的LLMs泛化能力有限的问题。尽管这些方法有效,但两大局限阻碍了LLMs的进一步发展。首先,传统的RL方法忽视了标注的思维链(CoT),并引入了不稳定的推理路径采样,这通常导致模型崩溃、训练过程不稳定以及性能欠佳。其次,现有的SFT方法普遍过度强调标注的CoT,可能因未能充分利用潜在CoT而导致性能下降。本文提出了一种基于标注CoT的对比学习强化微调方法,即CARFT,旨在提升LLMs的推理性能,同时解决上述局限。具体而言,我们提出为每个CoT学习一个表示,并基于此表示设计新颖的对比信号来指导微调过程。我们的方法不仅充分利用了可用的标注CoT,还通过引入额外的无监督学习信号稳定了微调过程。我们通过三种基线方法、两个基础模型和两个数据集进行了全面的实验和深入分析,证明了CARFT在鲁棒性、性能(提升高达10.15%)和效率(提升高达30.62%)方面的显著优势。代码已发布于https://github.com/WNQzhu/CARFT。
English
Reasoning capability plays a significantly critical role in the the broad
applications of Large Language Models (LLMs). To enhance the reasoning
performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning
approaches have been proposed to address the limited generalization capability
of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their
effectiveness, two major limitations hinder the advancement of LLMs. First,
vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and
incorporate unstable reasoning path sampling, which typically results in model
collapse, unstable training process, and suboptimal performance. Second,
existing SFT approaches generally overemphasize the annotated CoT, potentially
leading to performance degradation due to insufficient exploitation of
potential CoT. In this paper, we propose a Contrastive learning with annotated
CoT-based Reinforced Fine-Tuning approach, i.e., , to enhance the
reasoning performance of LLMs while addressing the aforementioned limitations.
Specifically, we propose learning a representation for each CoT. Based on this
representation, we design novel contrastive signals to guide the fine-tuning
process. Our approach not only fully exploits the available annotated CoT but
also stabilizes the fine-tuning procedure by incorporating an additional
unsupervised learning signal. We conduct comprehensive experiments and in-depth
analysis with three baseline approaches, two foundation models, and two
datasets to demonstrate significant advantages of in terms of
robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code
is available at https://github.com/WNQzhu/CARFT.