如何微调推理模型？一种师生协作框架以合成学生一致的监督微调数据

摘要

一种广泛采用的模型增强策略是利用更强模型生成的合成数据进行监督微调（SFT）。然而对于像Qwen3-8B这样的新兴推理模型，这种方法往往无法提升推理能力，甚至可能导致性能大幅下降。本研究发现，教师模型生成数据与学生模型分布之间的显著风格差异是影响SFT效果的主要因素。为弥合这一差距，我们提出师生协作数据合成框架（TESSY），通过交替调用教师模型和学生模型来分别生成风格性标记与非风格性标记。该方法生成的合成序列既能继承教师模型的高级推理能力，又能保持与学生模型分布的风格一致性。在代码生成实验中，以GPT-OSS-120B作为教师模型时，使用教师生成数据对Qwen3-8B进行微调会导致LiveCodeBench-Pro和OJBench上的性能分别下降3.25%和10.02%，而TESSY框架则实现了11.25%和6.68%的性能提升。

English

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.