拡散言語モデルにおけるシンクを考慮した枝刈り

要旨

拡散言語モデル（DLM）は、反復的なノイズ除去プロセスにより高い推論コストが生じるため、効率的な枝刈りの動機付けとなっている。既存の枝刈りヒューリスティックは、主に自己回帰（AR）LLMから継承されたもので、ARモデルにおける注意の吸収トークン（attention sink）が安定したグローバルアンカーとして機能するため、これらを保持することが一般的である。本論文では、この前提がDLMには当てはまらないことを示す：DLMでは、注意の吸収位置は生成軌跡全体を通じて著しく高い分散を示し（支配的な吸収位置がタイムステップ間でどのようにシフトするかで測定）、吸収がしばしば一時的であり、ARモデルほど構造的に必須ではないことを示唆している。この観察に基づき、我々は{bf Sink-Aware Pruning}を提案する。これは、DLMにおける不安定な吸収を自動的に識別し枝刈りする手法である（従来研究では通常AR LLMの吸収を保持する）。再学習なしで、本手法はより優れた品質と効率のトレードオフを達成し、同等の計算量条件下で強力な既存の枝刈りベースラインを上回る。コードはhttps://github.com/VILA-Lab/Sink-Aware-Pruningで公開されている。

English

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose {bf Sink-Aware Pruning}, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

拡散言語モデルにおけるシンクを考慮した枝刈り

Sink-Aware Pruning for Diffusion Language Models

要旨

Support