ファインチューニングの批評：批評を学ぶことは模倣を学ぶよりも効果的である

要旨

教師ありファインチューニング（SFT）は、言語モデルを指示に従った注釈付き応答を模倣するために一般的に使用されています。本論文では、このパラダイムに挑戦し、批評ファインチューニング（CFT）を提案します。CFTは、モデルが単に正しいものを模倣するのではなく、ノイズの多い応答を批評することを学ぶ戦略です。批評ファインチューニングは、批判的思考を重視する人間の学習プロセスに触発され、より深い分析と微妙な理解を促します。これらの特性は、標準的なSFTではしばしば見落とされています。CFTの効果を検証するために、GPT-4oを教師として使用し、WebInstructから50Kサンプルのデータセットを構築し、入力=[クエリ；ノイズの多い応答]、出力=批評という形式で批評を生成します。このデータセットでのCFTは、Qwen2.5、Qwen2.5-Math、DeepSeek-Mathなどの異なるベースモデルにおける6つの数学ベンチマークで、SFTに比べて一貫した4-10%の改善をもたらします。さらに、MetaMathとNuminaMathのデータセットに拡張し、SFTに比べて同様の利点を観察します。特筆すべきは、われわれのQwen2.5-Math-CFTモデルは、たった50Kサンプルで訓練され、2Mサンプル以上を使用するAceMathやQwen2.5-Math-Instructなどの競合モデルをほとんどのベンチマークで凌駕または上回ることです。削減研究によると、CFTはノイズの多い応答のソースや教師の批評モデルに対して頑健であることが示されています。これらの発見を通じて、批評に基づくトレーニングが言語モデルの推論を進めるためのより効果的な代替手段を提供すると主張しています。

English

Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of (input=[query; noisy response], output=critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our Qwen2.5-Math-CFT model-trained on just 50K samples-matches or outperforms competitive models such as AceMath and Qwen2.5-Math-Instruct on most benchmarks, both of which use over 2M samples. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that critique-based training offers a more effective alternative to advance the reasoning of language models.

ファインチューニングの批評：批評を学ぶことは模倣を学ぶよりも効果的である

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

要旨

Support