ChatPaper.aiChatPaper

轨迹作为教师:通过能量导航蒸馏的少步离散流匹配

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

May 8, 2026
作者: Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova
cs.AI

摘要

离散流匹配通过将噪声标记逐步转化为连贯语言来生成文本,但可能需要数百次前向传递。蒸馏利用多步轨迹训练学生模型,使其在少量步骤中复现该过程。当学生模型表现不佳时,通常的解释是容量不足。我们持相反观点:瓶颈在于轨迹,而非学生模型。每条训练轨迹通过一系列盲目的随机跳跃构建而成,且未对序列质量进行评估;早期中间步骤的一个错误决策会传播到后续步骤,而学生模型却必须模仿这一结果。轨迹塑造离散流匹配(TS-DFM)用引导式导航取代了这些盲目跳跃:一个轻量级能量指南针在每个中间步骤评估候选续接方案,选择最连贯的路径。所有塑造过程仅在训练阶段进行,推理成本保持不变。在170M参数的语言建模任务中,经过塑造的学生模型在8步内实现了比1024步教师模型低32%的困惑度,同时速度提升128倍,且这一优势在多种源分布及三种规模递增的评估器上保持一致。TS-DFM在我们对比的所有离散生成基线中取得了最佳困惑度,包括那些在6倍数据量或使用5倍模型规模上训练的方法。
English
Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.
PDF01May 12, 2026