ChatPaper.aiChatPaper

通过单问题批判性微调释放预训练大语言模型的推理潜能

Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

June 3, 2025
作者: Yubo Wang, Ping Nie, Kai Zou, Lijun Wu, Wenhu Chen
cs.AI

摘要

我们观察到,诸如Qwen-Math、MiMo和Phi-4等强大的大语言模型(LLMs)在预训练阶段便继承了巨大的推理潜能。通过强化学习(RL),这些模型在推理任务上能取得显著进步。近期研究表明,即便仅针对单一问题进行RL训练,也能充分释放这些模型的推理能力。然而,RL不仅成本高昂,且稳定性欠佳,即使是一次性RL训练也需要耗费数百GPU小时。这引发了一个关键问题:是否存在更高效的方法来激发这些强大基础LLMs的推理潜力?在本研究中,我们证明了仅针对一个问题进行批判性微调(Critique Fine-Tuning, CFT)即可有效释放LLMs的推理潜能。我们的方法通过收集模型对单一问题生成的多样化解法,并利用教师LLMs提供详尽批判,构建CFT数据。我们对参数规模从1.5B到14B不等的Qwen和Llama系列模型进行CFT微调,观察到在多种推理任务上性能显著提升。例如,仅用5GPU小时的训练,Qwen-Math-7B-CFT在六个数学基准测试上平均提升了15%,在三个逻辑推理基准上提升了16%。这些成果与RL相比,在计算资源减少20倍的情况下,效果相当甚至更优。消融研究揭示了一次性CFT在不同提示问题上的鲁棒性。这些结果凸显了一次性CFT作为一种简单、通用且计算高效的方法,在释放现代LLMs推理能力方面的优势。
English
We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.
PDF162June 5, 2025