CARFT: 注釈付き連鎖思考ベースの強化学習による対比学習を用いた大規模言語モデルの推論能力向上

要旨

大規模言語モデル（LLMs）の幅広い応用において、推論能力は極めて重要な役割を果たします。LLMsの推論性能を向上させるために、教師ありファインチューニング（SFT）のみで訓練されたLLMsの限定的な汎化能力に対処するため、多様な強化学習（RL）ベースのファインチューニング手法が提案されています。これらの手法は有効であるものの、LLMsの進歩を妨げる2つの主要な制約が存在します。第一に、従来のRLベースの手法は注釈付きChain-of-Thought（CoT）を無視し、不安定な推論パスのサンプリングを組み込むため、モデルの崩壊、不安定な訓練プロセス、そして最適でない性能を引き起こすことが一般的です。第二に、既存のSFT手法は注釈付きCoTを過度に重視する傾向があり、潜在的なCoTの活用が不十分であるため、性能の低下を招く可能性があります。本論文では、これらの制約に対処しつつLLMsの推論性能を向上させるため、注釈付きCoTベースの強化学習ファインチューニング手法であるContrastive learning with annotated CoT-based Reinforced Fine-Tuning（CARFT）を提案します。具体的には、各CoTの表現を学習し、この表現に基づいてファインチューニングプロセスを導く新しいコントラスティブ信号を設計します。提案手法は、利用可能な注釈付きCoTを十分に活用するだけでなく、追加の教師なし学習信号を組み込むことでファインチューニング手順を安定化します。3つのベースライン手法、2つの基盤モデル、および2つのデータセットを用いた包括的な実験と詳細な分析を通じて、CARFTが堅牢性、性能（最大10.15%）、効率性（最大30.62%）の点で大きな優位性を持つことを示します。コードはhttps://github.com/WNQzhu/CARFTで公開されています。

English

Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., , to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at https://github.com/WNQzhu/CARFT.

CARFT: 注釈付き連鎖思考ベースの強化学習による対比学習を用いた大規模言語モデルの推論能力向上

CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

要旨

Support