Poda Consciente do Afundamento para Modelos de Linguagem de Difusão

Resumo

Os Modelos de Linguagem de Difusão (DLMs) incorrem em alto custo de inferência devido à desnudação iterativa, motivando a poda eficiente. As heurísticas de poda existentes, herdadas em grande parte dos LLMs autoregressivos (AR), normalmente preservam os *tokens* de *attention sinks* (sumidouros de atenção), pois os *sinks* AR servem como âncoras globais estáveis. Demonstramos que esta premissa não se mantém para os DLMs: a posição do *attention sink* exibe uma variância substancialmente maior ao longo de toda a trajetória de geração (medida pela forma como as posições dominantes dos *sinks* mudam entre os *timesteps*), indicando que os *sinks* são frequentemente transitórios e estruturalmente menos essenciais do que nos modelos AR. Com base nesta observação, propomos a **Poda Consciente do *Sink***, que identifica e poda automaticamente *sinks* instáveis em DLMs (estudos anteriores geralmente mantêm *sinks* para LLMs AR). Sem retreino, o nosso método alcança um melhor equilíbrio entre qualidade e eficiência e supera fortes *baselines* de poda anteriores sob computação equivalente. O nosso código está disponível em https://github.com/VILA-Lab/Sink-Aware-Pruning.

English

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose {bf Sink-Aware Pruning}, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.