TiDAR:以扩散思维,行自回归对话
TiDAR: Think in Diffusion, Talk in Autoregression
November 12, 2025
作者: Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov
cs.AI
摘要
扩散语言模型有望实现快速并行生成,而自回归模型因其因果结构与语言建模天然契合,通常在生成质量上更胜一筹。这引出一个根本性问题:能否实现高吞吐量、高GPU利用率与自回归级质量三者协同?现有方法未能有效平衡这两方面——要么采用较弱模型进行顺序草稿生成(推测解码)而优先保证自回归特性,导致草稿效率低下;要么为扩散模型引入某种左到右(类自回归)解码逻辑,但仍存在质量下降问题并丧失并行潜力。我们提出TiDAR,一种序列级混合架构:通过特殊设计的结构化注意力掩码,在单次前向传播中实现扩散式草稿生成(思考)与自回归式最终采样(表达)。该设计充分利用GPU空闲计算密度,在草稿生成与验证能力间达成强力平衡。此外,TiDAR作为独立模型具备服务友好性(低开销)。我们在1.5B和8B规模上对TiDAR与自回归模型、推测解码及扩散变体进行了生成与似然任务的全面评估。得益于并行草稿生成、采样机制及精确KV缓存支持,TiDAR在实测吞吐量上超越推测解码,在效率和质量上均优于Dream、Llada等扩散模型。尤为重要的是,TiDAR是首个在实现每秒4.71至5.91倍token生成量的同时,成功弥合与自回归模型质量差距的架构。
English
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.