Fine-Tuning 的批評：學習批評比學習模仿更有效果

摘要

監督微調（SFT）通常用於訓練語言模型來模仿給定指示的標註回應。在本文中，我們挑戰這種範式，提出批判微調（CFT），一種策略，模型學習批判嘈雜的回應，而不僅僅是模仿正確的回應。受強調批判性思維的人類學習過程的啟發，CFT鼓勵更深入的分析和細緻的理解，這些特質通常被標準SFT所忽略。為驗證CFT的有效性，我們從WebInstruct構建了一個包含50K樣本的數據集，使用GPT-4o作為教師生成評論，形式為（輸入=[查詢; 嘈雜回應]，輸出=評論）。在這個數據集上進行的CFT相對於六個數學基準測試中的SFT，使用不同基礎模型如Qwen2.5、Qwen2.5-Math和DeepSeek-Math，實現了一致的4-10%改進。我們進一步擴展到MetaMath和NuminaMath數據集，觀察到相對於SFT的類似增益。值得注意的是，我們的Qwen2.5-Math-CFT模型-僅在50K樣本上訓練-在大多數基準測試中與競爭模型如AceMath和Qwen2.5-Math-Instruct匹敵或表現更好，後者使用了超過2M樣本。消融研究表明，CFT對於嘈雜回應來源和教師評論模型都具有韌性。通過這些發現，我們主張基於批判的訓練提供了一種更有效的替代方案，以推進語言模型的推理能力。

English

Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of (input=[query; noisy response], output=critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our Qwen2.5-Math-CFT model-trained on just 50K samples-matches or outperforms competitive models such as AceMath and Qwen2.5-Math-Instruct on most benchmarks, both of which use over 2M samples. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that critique-based training offers a more effective alternative to advance the reasoning of language models.

Fine-Tuning 的批評：學習批評比學習模仿更有效果

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

摘要

Summary

Support