通過單一問題的批判性微調釋放預訓練大語言模型的推理潛能

摘要

我們觀察到，如Qwen-Math、MiMo和Phi-4等強大的大型語言模型（LLMs）在預訓練階段便已繼承了巨大的推理潛能。通過強化學習（RL），這些模型在推理任務上的表現能顯著提升。近期研究表明，即使僅針對單一問題進行RL訓練，也能釋放這些模型的推理能力。然而，RL不僅成本高昂，且穩定性欠佳。即使是一次性的RL訓練，也需要耗費數百個GPU小時。這引發了一個關鍵問題：是否存在更高效的方法來釋放這些強大基礎LLMs的推理潛能？在本研究中，我們證明了僅需對單一問題進行批判性微調（Critique Fine-Tuning, CFT），即可有效釋放LLMs的推理潛能。我們的方法通過收集模型針對單一問題生成的多樣化解決方案，並利用教師LLMs提供詳細批判，來構建批判數據。我們對參數量從1.5B到14B不等的Qwen和Llama系列模型進行CFT數據微調，並在多樣化的推理任務上觀察到顯著的性能提升。例如，僅需5個GPU小時的訓練，Qwen-Math-7B-CFT在六個數學基準測試上平均提升了15%，在三個邏輯推理基準測試上提升了16%。這些結果與RL相比，在計算量減少20倍的情況下，表現相當甚至更優。消融研究揭示了一次性CFT在不同提示問題下的穩健性。這些結果凸顯了一次性CFT作為一種簡單、通用且計算高效的途徑，能夠有效釋放現代LLMs的推理能力。

English

We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

通過單一問題的批判性微調釋放預訓練大語言模型的推理潛能

Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

摘要

Support