CARFT：通过基于标注思维链的对比学习与强化微调提升大语言模型推理能力

摘要

推理能力在大规模语言模型（LLMs）的广泛应用中扮演着至关重要的角色。为了提升LLMs的推理性能，多种基于强化学习（RL）的微调方法被提出，以解决仅通过监督微调（SFT）训练的LLMs泛化能力有限的问题。尽管这些方法有效，但两大局限阻碍了LLMs的进一步发展。首先，传统的RL方法忽视了标注的思维链（CoT），并引入了不稳定的推理路径采样，这通常导致模型崩溃、训练过程不稳定以及性能欠佳。其次，现有的SFT方法普遍过度强调标注的CoT，可能因未能充分利用潜在CoT而导致性能下降。本文提出了一种基于标注CoT的对比学习强化微调方法，即CARFT，旨在提升LLMs的推理性能，同时解决上述局限。具体而言，我们提出为每个CoT学习一个表示，并基于此表示设计新颖的对比信号来指导微调过程。我们的方法不仅充分利用了可用的标注CoT，还通过引入额外的无监督学习信号稳定了微调过程。我们通过三种基线方法、两个基础模型和两个数据集进行了全面的实验和深入分析，证明了CARFT在鲁棒性、性能（提升高达10.15%）和效率（提升高达30.62%）方面的显著优势。代码已发布于https://github.com/WNQzhu/CARFT。

English

Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., , to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at https://github.com/WNQzhu/CARFT.

CARFT：通过基于标注思维链的对比学习与强化微调提升大语言模型推理能力

CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

摘要

Support