ChatPaper.aiChatPaper

直奔精华:扩散模型与自回归大语言模型中的表征结构与推理时层跳跃机制对比

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

March 8, 2026
作者: Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli
cs.AI

摘要

自回归语言模型通过从左到右的预测逐步构建表征,而扩散语言模型则通过全序列去噪进行训练。尽管当前扩散模型已能匹配自回归模型的性能,但其训练目标是否从根本上重塑了不同深度的内部表征仍不明确。我们首次对原生扩散模型(LLaDA)、原生自回归模型(Qwen2.5)以及自回归初始化的扩散模型(Dream-7B)进行了分层分词的表征对比分析。研究发现:扩散目标会形成更具层次性的抽象表征,其底层存在大量冗余且近因偏差减弱;而自回归目标则产生高度耦合的深度依赖表征。关键发现是,尽管经过扩散训练,自回归初始化的扩散模型仍保持类自回归的表征动态,揭示了初始化偏差的持续性。基于观察到的表征冗余现象,我们提出了一种无需改动架构或共享KV缓存的任务无关静态跳层推理方法。原生扩散模型在保持推理和代码生成基准性能90%以上的同时,最高可实现18.75%的浮点运算量削减,而自回归模型在同等跳层条件下性能急剧下降。这些发现建立了训练目标与表征结构的关联,并为实现与缓存正交的实用化效率提升提供了路径。
English
Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
PDF32March 15, 2026