压缩-蒸馏：用于高效知识蒸馏的推理轨迹压缩

摘要

推理模型生成的思维链轨迹冗长，不仅蒸馏成本高昂，还容易导致学生模型输出冗余内容。本研究探索在知识蒸馏前对这些轨迹进行事后压缩。两个教师模型（Qwen3.5-397B-A17B 和 gpt-oss-120B）各生成约28.3万条正确轨迹，而后由两个指令微调模型将其压缩至原始字符长度的8.6%-21.0%。通过包含48次主网格实验及七组Qwen教师模型截断消融实验的系统评估显示：压缩轨迹可将训练令牌数降至原始文本的12%-30%，训练速度提升2.0-7.6倍，推理输出长度缩短3-19倍（其中gpt-oss教师模型的缩短幅度相对较小）。然而在各类模型规模和教师条件下，原始轨迹始终保持着最高的下游任务准确率。通过长度匹配的原始轨迹截断消融实验表明，压缩效果并非单纯受益于更小的令牌预算：模型压缩后的轨迹通常优于或持平于简单截断（尤其对较小规模的学生模型），同时保持更短的推理输出。总体而言，推理轨迹压缩呈现准确率与效率的权衡关系而非免费改进：学生模型在保留原始轨迹准确率高达96%的同时，可获得最高18倍的每令牌效率提升；在0.8B参数规模下采用LoRA方法时，压缩轨迹虽能缩小与原始轨迹的准确率差距，但始终未能超越后者。

English

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.