批判的微調整による事前学習済みLLMの推論能力の解放：単一問題への適用

要旨

Qwen-Math、MiMo、Phi-4などの強力な大規模言語モデル（LLM）は、事前学習段階から継承された膨大な推論ポテンシャルを有していることが確認されています。強化学習（RL）を適用することで、これらのモデルは推論タスクにおいて劇的な改善を示します。最近の研究では、たった一つの問題に対するRLでも、これらのモデルの推論能力を引き出せることが示されています。しかし、RLは高コストであるだけでなく、不安定でもあります。ワンショットRLでさえ数百GPU時間を要します。これにより、強力なベースLLMの推論ポテンシャルを引き出すためのより効率的な方法は存在するのか、という重要な疑問が浮上します。本研究では、たった一つの問題に対するCritique Fine-Tuning（CFT）が、LLMの推論ポテンシャルを効果的に引き出せることを実証します。私たちの手法は、単一の問題に対する多様なモデル生成ソリューションを収集し、教師LLMを用いて詳細な批評を提供することで、批評データを構築します。1.5Bから14BパラメータまでのQwenおよびLlamaファミリーモデルをCFTデータでファインチューニングし、多様な推論タスクにおいて顕著な性能向上を観察しました。例えば、わずか5GPU時間のトレーニングで、Qwen-Math-7B-CFTは6つの数学ベンチマークで平均15%、3つの論理推論ベンチマークで16%の改善を示しました。これらの結果は、20分の1の計算量でRLと同等またはそれ以上の性能に匹敵します。アブレーションスタディは、異なるプロンプト問題に対するワンショットCFTの堅牢性を明らかにしています。これらの結果は、ワンショットCFTが、現代のLLMの推論能力を引き出すためのシンプルで汎用的、かつ計算効率の良いアプローチであることを強調しています。

English

We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

批判的微調整による事前学習済みLLMの推論能力の解放：単一問題への適用

Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

要旨

Support