内省扩散语言模型（注：该标题采用学术文献标准译法，其中"Introspective"译为"内省"以体现模型的自省特性，"Diffusion Language Models"采用AI领域通用译法"扩散语言模型"，整体保持技术术语的准确性与简洁性。）

摘要

扩散语言模型虽有望实现并行生成，但在质量上仍落后于自回归模型。我们将此差距归因于内省一致性的缺失：自回归模型能与其自身生成内容保持一致，而扩散语言模型往往无法做到。我们定义了内省接受率这一指标，用于衡量模型是否接受其先前生成的标记。这揭示了自回归训练的结构性优势：因果掩码和逻辑偏移隐式地强化了内省一致性。基于此发现，我们提出内省扩散语言模型（I-DLM），该范式在保留扩散式并行解码的同时，继承了自回归训练的内省一致性。I-DLM采用新颖的内省跨步解码算法，使模型能在同一次前向传播中验证已生成标记的同时推进新标记生成。从系统视角出发，我们在自回归优化基础上构建I-DLM推理引擎，并通过静态批调度器进一步定制。据我们所知，I-DLM是首个在质量上媲美同规模自回归模型、同时在15项基准测试中超越现有扩散语言模型的方案，其模型质量与实际服务效率均表现卓越。该模型在AIME-24上达到69.6分，在LiveCodeBench-v6上获得45.7分，分别较LLaDA-2.1-mini（16B）高出26分和15分以上。除质量优势外，I-DLM专为日益增长的高并发服务需求设计，其吞吐量较现有最优扩散语言模型提升约3倍。

English

Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.