掀起TIDE浪潮：面向扩散大语言模型的跨架构蒸馏方法

摘要

扩散大语言模型（dLLMs）具备并行解码和双向上下文处理能力，但当前最先进的dLLMs需要数十亿参数才能实现有竞争力的性能。现有dLLMs蒸馏方法虽能在单一架构内减少推理步数，但均未解决跨架构知识迁移问题——即教师模型与学生模型在架构、注意力机制和分词器方面存在差异。我们提出首个跨架构dLLM蒸馏框架TIDE，其包含三个模块化组件：（1）TIDAL模块通过联合调节训练进程和扩散时间步的蒸馏强度，动态适应教师模型对噪声的可靠性变化；（2）CompDemo模块采用互补掩码分割策略增强教师模型的上下文理解，提升重度掩码下的预测质量；（3）Reverse CALM作为跨分词器优化目标，通过反转分块级似然匹配实现有界梯度计算和双端噪声过滤。通过两条异构流水线将80亿参数稠密模型和160亿参数MoE教师模型蒸馏至6亿参数学生模型，在八项基准测试中平均超越基线1.53个点，其中代码生成任务提升显著：HumanEval得分达到48.78，相比自回归基线的32.3分实现重大突破。

English

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

掀起TIDE浪潮：面向扩散大语言模型的跨架构蒸馏方法

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

摘要

Support