궤적을 교사로: 에너지 기반 증류를 통한 소수 단계 이산 흐름 매칭

초록

이산 흐름 정합(Discrete Flow Matching)은 노이즈 토큰을 반복적으로 변환하여 일관된 언어를 생성하지만, 수백 번의 순방향 패스를 필요로 할 수 있다. 증류(distillation)는 다단계 궤적(trajectory)을 활용하여 학생(student) 모델이 몇 단계 만에 이 과정을 재현하도록 훈련한다. 학생 모델의 성능이 낮을 때, 일반적인 설명은 충분하지 않은 용량(capacity) 때문이다. 우리는 반대의 주장을 펼친다: 병목은 학생이 아니라 궤적이다. 각 훈련 궤적은 시퀀스 품질에 대한 평가 없이 맹목적인 확률적 점프(stochastic jump)의 연쇄를 통해 구축되며, 초기 중간 지점에서의 단일한 잘못된 결정이 이후 단계에 전파되지만, 학생은 그 결과를 모방해야 한다. 궤적 형성 이산 흐름 정합(TS-DFM)은 이러한 맹목적인 점프를 안내된 탐색(guided navigation)으로 대체한다: 가벼운 에너지 나침반(energy compass)이 각 중간 지점에서 후보 연속을 평가하여 가장 일관성 있는 것을 선택한다. 모든 형성(shaping)은 훈련에만 적용되며, 추론 비용은 변하지 않는다. 1억 7천만 매개변수 언어 모델링에서, 8단계의 형성된 학생은 1,024단계 교사(teacher)보다 32% 낮은 혼란도(perplexity)를 달성하면서 128배 더 빠르며, 이러한 이점은 소스 분포와 세 가지 평가자(평가 규모 증가)에 걸쳐 일관된다. TS-DFM은 우리가 비교한 모든 이산 생성 기준선 중 최고의 혼란도를 달성하며, 여기에는 6배 더 많은 데이터로 훈련되거나 5배 더 큰 모델을 사용한 방법도 포함된다.

English

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

궤적을 교사로: 에너지 기반 증류를 통한 소수 단계 이산 흐름 매칭

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

초록

Support