通过单问题批判性微调释放预训练大语言模型的推理潜能

摘要

我们观察到，诸如Qwen-Math、MiMo和Phi-4等强大的大语言模型（LLMs）在预训练阶段便继承了巨大的推理潜能。通过强化学习（RL），这些模型在推理任务上能取得显著进步。近期研究表明，即便仅针对单一问题进行RL训练，也能充分释放这些模型的推理能力。然而，RL不仅成本高昂，且稳定性欠佳，即使是一次性RL训练也需要耗费数百GPU小时。这引发了一个关键问题：是否存在更高效的方法来激发这些强大基础LLMs的推理潜力？在本研究中，我们证明了仅针对一个问题进行批判性微调（Critique Fine-Tuning, CFT）即可有效释放LLMs的推理潜能。我们的方法通过收集模型对单一问题生成的多样化解法，并利用教师LLMs提供详尽批判，构建CFT数据。我们对参数规模从1.5B到14B不等的Qwen和Llama系列模型进行CFT微调，观察到在多种推理任务上性能显著提升。例如，仅用5GPU小时的训练，Qwen-Math-7B-CFT在六个数学基准测试上平均提升了15%，在三个逻辑推理基准上提升了16%。这些成果与RL相比，在计算资源减少20倍的情况下，效果相当甚至更优。消融研究揭示了一次性CFT在不同提示问题上的鲁棒性。这些结果凸显了一次性CFT作为一种简单、通用且计算高效的方法，在释放现代LLMs推理能力方面的优势。

English

We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

通过单问题批判性微调释放预训练大语言模型的推理潜能

Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

摘要

Support