확산 언어 모델을 위한 싱크 인지 프루닝

초록

확산 언어 모델(DLM)은 반복적인 노이즈 제거 과정으로 인해 높은 추론 비용이 발생하므로 효율적인 프루닝(가지치기)이 필요합니다. 기존의 프루닝 휴리스틱은 주로 자기회귀(AR) LLM에서 상속된 것으로, 일반적으로 AR 모델에서 안정적인 전역 기준점 역할을 하는 어텐션 싱크 토큰을 보존합니다. 본 연구에서는 이러한 가정이 DLM에는 적용되지 않음을 보여줍니다: DLM에서 어텐션 싱크 위치는 전체 생성 궤적에 걸쳐 상당히 높은 분산을 보이며(지배적인 싱크 위치가 타임스텝에 따라 어떻게 이동하는지로 측정), 이는 싱크가 종종 일시적이며 AR 모델보다 구조적으로 덜 필수적임을 시사합니다. 이러한 관찰을 바탕으로, 우리는 {bf 싱크 인식 프루닝}을 제안합니다. 이 방법은 DLM에서 불안정한 싱크를 자동으로 식별하고 제거합니다(기존 연구는 일반적으로 AR LLM을 위해 싱크를 유지함). 재학습 없이도 우리의 방법은 더 나은 품질-효율성 트레이드오프를 달성하고, 동일한 계산량 조건에서 강력한 기존 프루닝 베이스라인을 능가합니다. 우리의 코드는 https://github.com/VILA-Lab/Sink-Aware-Pruning에서 확인할 수 있습니다.

English

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose {bf Sink-Aware Pruning}, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

확산 언어 모델을 위한 싱크 인지 프루닝

Sink-Aware Pruning for Diffusion Language Models

초록

Support