跳過平淡章節:擴散模型與自回歸大模型的表徵結構及推理時層跳躍機制
Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
March 8, 2026
作者: Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli
cs.AI
摘要
自迴歸語言模型通過從左到右的預測逐步形成表徵,而擴散語言模型則通過全序列去噪進行訓練。儘管當前擴散模型的性能已可媲美自迴歸模型,但其訓練目標是否會從根本上重塑不同網絡深度的內部表徵仍不明確。我們首次開展了層級與詞元級別的表徵對比分析,比較了原生擴散模型(LLaDA)、原生自迴歸模型(Qwen2.5)及自迴歸初始化擴散模型(Dream-7B)。研究發現:擴散目標會形成更具層次性的抽象表徵,表現為底層存在大量冗餘且近因偏差減弱;而自迴歸目標則產生緊密耦合、深度依賴的表徵。關鍵在於,自迴歸初始化的擴散模型即便經過擴散訓練,仍保持著類自迴歸的表徵動力學,揭示了初始化的持續性偏差。基於觀察到的表徵冗餘現象,我們提出了一種無需修改架構或共享KV快取的靜態任務無關推理階段跳層方法。原生擴散模型可實現最高18.75%的浮點運算量削減,同時在推理與代碼生成基準測試中保持90%以上性能,而自迴歸模型在同等跳層條件下性能急劇下降。這些發現建立了訓練目標與表徵結構的關聯,並為實現與快存機制正交的實用性效率提升提供了路徑。
English
Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.