轨迹作为教师：通过能量导航蒸馏的少步离散流匹配

摘要

离散流匹配通过将噪声标记逐步转化为连贯语言来生成文本，但可能需要数百次前向传递。蒸馏利用多步轨迹训练学生模型，使其在少量步骤中复现该过程。当学生模型表现不佳时，通常的解释是容量不足。我们持相反观点：瓶颈在于轨迹，而非学生模型。每条训练轨迹通过一系列盲目的随机跳跃构建而成，且未对序列质量进行评估；早期中间步骤的一个错误决策会传播到后续步骤，而学生模型却必须模仿这一结果。轨迹塑造离散流匹配（TS-DFM）用引导式导航取代了这些盲目跳跃：一个轻量级能量指南针在每个中间步骤评估候选续接方案，选择最连贯的路径。所有塑造过程仅在训练阶段进行，推理成本保持不变。在170M参数的语言建模任务中，经过塑造的学生模型在8步内实现了比1024步教师模型低32%的困惑度，同时速度提升128倍，且这一优势在多种源分布及三种规模递增的评估器上保持一致。TS-DFM在我们对比的所有离散生成基线中取得了最佳困惑度，包括那些在6倍数据量或使用5倍模型规模上训练的方法。

English

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

轨迹作为教师：通过能量导航蒸馏的少步离散流匹配

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

摘要

Support