ChatPaper.aiChatPaper

TiDAR:擴散思維,自回歸對話

TiDAR: Think in Diffusion, Talk in Autoregression

November 12, 2025
作者: Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov
cs.AI

摘要

擴散語言模型具備快速平行生成的潛力,而自迴歸模型則因其因果結構與語言建模的自然契合性通常在生成品質上更勝一籌。這引發了一個根本性問題:能否實現兼具高吞吐量、更高GPU利用率與自迴歸級品質的協同效應?現有方法未能有效平衡這兩方面,要麼採用較弱模型進行序列草稿生成(如推測解碼)而優先考慮自迴歸特性,導致草稿效率低下;要麼為擴散模型引入某種左至右(類自迴歸)解碼邏輯,但仍存在品質下降問題並喪失其潛在的平行化優勢。我們提出TiDAR——一種序列級混合架構,通過特殊設計的結構化注意力掩碼,在單次前向傳播中實現擴散階段的令牌草稿生成(思考)與自迴歸階段的最終輸出採樣(表達)。該設計充分利用GPU計算密度,在草稿生成與驗證能力間達成強效平衡。此外,TiDAR作為獨立模型具備服務友好性(低開銷)。我們在15億和80億參數規模下,針對生成與概似任務對TiDAR進行廣泛評估,對比對象包括自迴歸模型、推測解碼及擴散變體。得益於平行草稿生成與採樣機制以及精確的KV緩存支持,TiDAR在實測吞吐量上超越推測解碼,並在效率和品質方面勝過Dream、Llada等擴散模型。最顯著的是,TiDAR成為首個在品質上追平自迴歸模型的同時,實現每秒生成令牌數提升4.71倍至5.91倍的架構。
English
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
PDF1084December 1, 2025