壓縮蒸餾：用於高效知識蒸餾的推理軌跡壓縮

摘要

推理模型会产生冗长的思维链轨迹，这些轨迹的蒸馏成本高昂，并促使学生模型输出冗长的内容。我们研究了在知识蒸馏前对这些轨迹进行事后压缩的方法。两个教师模型——Qwen3.5-397B-A17B 和 gpt-oss-120B——各自生成了约 283k 条正确轨迹；随后两个经过指令微调的模型将其压缩至原始字符长度的 8.6% 至 21.0%。在包含 48 次运行的主网格实验及七次 Qwen 教师截断消融实验中，压缩轨迹将训练 token 量降至原始水平的 12% 至 30%，训练速度提升 2.0 至 7.6 倍，推理输出长度缩短 3 至 19 倍（在较短的 gpt-oss 教师模型下缩减幅度较小）。然而，在所有规模和两个教师模型下，原始轨迹仍保持最高的下游准确率。一项长度匹配的原始轨迹截断消融实验表明，压缩并非仅仅得益于更小的 token 预算：模型压缩的轨迹通常优于或持平于朴素截断，尤其是在较小的学生模型上，同时保持更短的推理输出。总体而言，推理轨迹压缩提供了一种准确率与效率之间的权衡，而非免费的改进：学生模型保留了原始轨迹准确率的多达 96%，同时每 token 效率提升多达 18 倍；在 0.8B 规模下，采用 LoRA 时，压缩轨迹缩小了原始与压缩之间的差距，但并未超越原始轨迹。

English

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.