Traject als de leraar: weinig-staps discrete stromingsmatching via energiegestuurde distillatie

Samenvatting

Discrete flow matching genereert tekst door iteratief ruistokens om te zetten in coherente taal, maar kan honderden forward passes vereisen. Distillatie gebruikt de multi-step trajectorie om een student te trainen om het proces in enkele stappen te reproduceren. Wanneer de student minder presteert, is de gebruikelijke verklaring onvoldoende capaciteit. Wij beargumenteren het tegenovergestelde: de trajectorie is de bottleneck, niet de student. Elke trainingstrajectorie wordt opgebouwd via een keten van blinde stochastische sprongen zonder evaluatie van de sequentiekwaliteit; een enkele slechte beslissing op een vroeg middenpunt verspreidt zich door volgende stappen, maar de student moet het resultaat imiteren. Trajectory-Shaped Discrete Flow Matching (TS-DFM) vervangt deze blinde sprongen door begeleide navigatie: een lichtgewicht energiekompas evalueert kandidaat-vervolgstukken op elk middenpunt en selecteert de meest coherente. Alle shaping gebeurt alleen tijdens training; de inferentiekosten blijven ongewijzigd. Bij taalmodellering met 170M parameters bereikt de gevormde student in 8 stappen een 32% lagere perplexiteit dan de leraar met 1.024 stappen, terwijl hij 128 keer sneller is, met consistente verbeteringen over brondistributies en drie evaluatoren van toenemende schaal. TS-DFM behaalt de beste perplexiteit van alle discrete-generatie baselines waarmee we vergelijken, inclusief methoden getraind op 6x meer data of met 5x grotere modellen.

English

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

Traject als de leraar: weinig-staps discrete stromingsmatching via energiegestuurde distillatie

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

Samenvatting

Support