如何微調推理模型？基於師生協作的學生一致性監督式微調資料生成框架這個標題的翻譯保留了原文的技術術語（如「微調」、「推理模型」、「監督式微調」），同時以符合中文學術寫作習慣的方式重組了句式。將「Teacher-Student Cooperation Framework」意譯為「師生協作框架」更貼近中文技術文獻的表達方式，而「Synthesize Student-Consistent SFT Data」則精準轉化為「學生一致性監督式微調資料生成」，既保持技術準確性又實現語言流暢度。

摘要

模型增强的一種廣泛採用策略是使用更強模型生成的合成數據進行監督式微調（SFT）。然而，對於像Qwen3-8B這類新興推理模型，這種方法往往無法提升推理能力，甚至可能導致性能大幅下降。本研究發現，教師模型生成數據與學生模型分佈之間的顯著風格差異是影響SFT效果的關鍵因素。為彌合這一差距，我們提出教師-學生協同數據合成框架（TESSY），通過交錯調用教師模型與學生模型交替生成風格標記與非風格標記。該方法產生的合成序列既能繼承教師模型的高級推理能力，又能保持與學生模型分佈的風格一致性。在代碼生成實驗中，以GPT-OSS-120B作為教師模型時，使用教師生成數據對Qwen3-8B進行微調會導致LiveCodeBench-Pro和OJBench上的性能分別下降3.25%和10.02%，而TESSY則實現了11.25%和6.68%的性能提升。

English

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

摘要

Support