掀起TIDE浪潮：擴散式大型語言模型的跨架構蒸餾技術（注：TIDE在此作為雙關語，既指"Turning the TIDE"的字面意義「掀起浪潮」，同時也是技術術語的縮寫。標題保留原文巧思，透過「浪潮」意象暗合技術突破的象徵意義，並明確點出「跨架構蒸餾」的核心方法論。）

摘要

擴散式大型語言模型（dLLMs）具備平行解碼與雙向上下文處理優勢，但當前頂尖的dLLMs需耗費數十億參數才能達到競爭性表現。現有dLLMs蒸餾方法雖能於單一架構內減少推理步驟，卻皆未解決跨架構知識遷移的挑戰——即教師模型與學生模型在架構、注意力機制和分詞器上存在差異。我們提出首個跨架構dLLMs蒸餾框架TIDE，其包含三個模組化組件：（1）TIDAL機制，根據教師模型對雜訊的依賴性可靠度，同步調控訓練進程與擴散時間步的蒸餾強度；（2）CompDemo策略，透過互補掩碼分割強化教師模型的上下文表達，提升高掩碼率下的預測品質；（3）Reverse CALM目標函數，反轉區塊層級似然匹配的跨分詞器優化方法，產生有界梯度與雙端雜訊過濾效果。透過兩條異質流水線將80億參數稠密模型與160億參數混合專家模型蒸餾至6億參數學生模型，在八項基準測試中平均超越基線1.53分，其中程式碼生成任務表現尤為突出：HumanEval分數達48.78，相較自迴歸基線的32.3分實現顯著提升。

English

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

摘要

Support