Sink-bewusstes Pruning für Diffusionssprachmodelle

Zusammenfassung

Diffusions-Sprachmodelle (DLMs) verursachen aufgrund des iterativen Denoising-Prozesses hohe Inferenzkosten, was effizientes Pruning motiviert. Bestehende Pruning-Heuristiken, die größtenteils von autoregressiven (AR) LLMs übernommen wurden, bewahren typischerweise Attention-Sink-Tokens, da AR-Sinks als stabile globale Anker dienen. Wir zeigen, dass diese Annahme für DLMs nicht zutrifft: Die Position des Attention-Sinks weist über den gesamten Generierungspfad eine erheblich höhere Varianz auf (gemessen daran, wie sich die dominanten Sink-Positionen über die Zeitschritte hinweg verschieben), was darauf hindeutet, dass Sinks in DLMs oft transient und strukturell weniger essenziell sind als in AR-Modellen. Basierend auf dieser Beobachtung schlagen wir **Sink-Aware Pruning** vor, das instabile Sinks in DLMs automatisch identifiziert und entfernt (bisherige Studien bewahren Sinks typischerweise für AR-LLMs). Ohne Neutraining erreicht unsere Methode eine bessere Qualitäts-Effizienz-Abwägung und übertrifft unter gleichen Rechenkosten starke bisherige Pruning-Baselines. Unser Code ist verfügbar unter https://github.com/VILA-Lab/Sink-Aware-Pruning.

English

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose {bf Sink-Aware Pruning}, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

Sink-bewusstes Pruning für Diffusionssprachmodelle

Sink-Aware Pruning for Diffusion Language Models

Zusammenfassung

Support