TiDAR: 拡散で思考し、自己回帰で語る

要旨

拡散言語モデルは高速な並列生成の可能性を秘めている一方、自己回帰（AR）モデルは因果構造が言語モデリングに自然に適合するため、一般に品質面で優れています。これにより根本的な疑問が生じます：高いスループット、高いGPU利用率、そしてARモデル並みの品質を両立する相乗効果は達成可能か？既存手法はこれら二つの側面を効果的に両立できておらず、ARモデルを優先して弱いモデルで逐次起草する（投機的デコーディング）ことで起草効率が低下するか、あるいは拡散モデルに何らかの左から右への（AR的な）デコーディングロジックを適用するものの、品質劣化が生じ、並列化の可能性を損なっています。本論文ではTiDARを提案します。これは、拡散モデルでトークンを起草（Thinking）し、自己回帰的に最終出力をサンプリング（Talking）するシーケンスレベルのハイブリッドアーキテクチャで、特別に設計された構造化アテンションマスクを用いて単一のフォワードパス内で処理を完結します。この設計はGPUの計算密度を最大限に活用し、起草能力と検証能力の強力なバランスを実現します。さらにTiDARは、スタンドアロンモデルとしてサービス運用に適した（低オーバーヘッドな）設計となっています。1.5Bおよび8Bスケールの生成タスクと尤度タスクにおいて、ARモデル、投機的デコーディング、拡散モデル変種に対してTiDARを詳細に評価しました。並列起草・サンプリングと正確なKVキャッシュサポートにより、TiDARは測定スループットで投機的デコーディングを上回り、DreamやLladaなどの拡散モデルを効率性と品質の両面で凌駕します。特に注目すべきは、TiDARがARモデルとの品質差を初めて解消しつつ、毎秒4.71倍から5.91倍ものトークンを生成できる点です。

English

Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.

TiDAR: 拡散で思考し、自己回帰で語る

TiDAR: Think in Diffusion, Talk in Autoregression

要旨

Support