단일 문제에 대한 비평 미세 조정을 통해 사전 학습된 대형 언어 모델의 추론 잠재력 발휘하기

초록

Qwen-Math, MiMo, Phi-4와 같은 강력한 대형 언어 모델(LLM)이 사전 학습 단계에서 상속받은 엄청난 추론 잠재력을 가지고 있음을 목격했습니다. 강화 학습(RL)을 통해 이러한 모델들은 추론 작업에서 극적인 성능 향상을 보일 수 있습니다. 최근 연구에 따르면, 단일 문제에 대한 RL만으로도 이러한 모델들의 추론 능력을 발휘할 수 있다고 합니다. 그러나 RL은 비용이 많이 들 뿐만 아니라 불안정하기도 합니다. 심지어 원샷 RL도 수백 GPU 시간이 필요합니다. 이는 중요한 질문을 제기합니다: 이러한 강력한 기본 LLM들의 추론 잠재력을 발휘하는 더 효율적인 방법이 있을까요? 본 연구에서는 단일 문제에 대한 비평 미세 조정(Critique Fine-Tuning, CFT)만으로도 LLM의 추론 잠재력을 효과적으로 발휘할 수 있음을 입증합니다. 우리의 방법은 단일 문제에 대한 다양한 모델 생성 솔루션을 수집하고, 교사 LLM을 사용하여 상세한 비평을 제공함으로써 비평 데이터를 구성합니다. 1.5B에서 14B 파라미터에 이르는 Qwen 및 Llama 계열 모델을 CFT 데이터에 대해 미세 조정하고, 다양한 추론 작업에서 상당한 성능 향상을 관찰했습니다. 예를 들어, 단 5 GPU 시간의 훈련으로 Qwen-Math-7B-CFT는 6개의 수학 벤치마크에서 평균 15%, 3개의 논리 추론 벤치마크에서 16%의 성능 향상을 보였습니다. 이러한 결과는 RL을 사용한 결과와 비슷하거나 더 나은 성능을 보이며, 계산 비용은 20배 적게 듭니다. 다양한 프롬프트 문제에 대한 원샷 CFT의 견고성을 보여주는 절제 연구도 수행했습니다. 이러한 결과는 원샷 CFT가 현대 LLM의 추론 능력을 발휘하는 간단하고 일반적이며 계산 효율적인 접근 방식임을 강조합니다.

English

We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

단일 문제에 대한 비평 미세 조정을 통해 사전 학습된 대형 언어 모델의 추론 잠재력 발휘하기

Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

초록

Support