Sparse-dLLM: 동적 캐시 제거를 통한 Diffusion LLM 가속화

초록

확산 기반 대형 언어 모델(dLLMs)은 추론과 병렬 디코딩 분야에서 획기적인 발전을 이뤘지만, 추론 과정에서 발생하는 2차 계산 복잡도와 메모리 오버헤드로 인해 실질적인 활용이 제한되고 있습니다. 현재의 캐싱 기술은 전체 레이어 상태를 저장함으로써 디코딩 속도를 향상시키지만, 상당한 메모리 사용량을 요구하여 장문맥 응용에 제약을 가합니다. dLLMs의 어텐션 패턴 분석 결과, 디코딩 단계 전반에 걸쳐 핵심 토큰은 지속적으로 중요성을 유지하고, 관련성이 낮은 토큰은 계속해서 중요하지 않다는 지속적인 교차 레이어 희소성이 발견되었으며, 이는 선택적 캐시 제거의 필요성을 시사합니다. 우리는 Sparse-dLLM을 제안합니다. 이는 지연된 양방향 희소 캐싱을 통해 동적 캐시 제거와 희소 어텐션을 통합한 최초의 학습 없는 프레임워크입니다. 토큰 중요성의 단계별 안정성을 활용하여 핵심 토큰을 유지하고, 어텐션 기반 전략을 통해 중요하지 않은 접두사/접미사 항목을 동적으로 제거합니다. LLaDA 및 Dream 시리즈에 대한 광범위한 실험 결과, Sparse-dLLM은 기존 dLLMs 대비 최대 10배 높은 처리량을 달성하면서도 유사한 성능과 최대 메모리 사용량을 유지하며, 이전 방법들을 효율성과 효과성 측면에서 능가하는 것으로 나타났습니다.

English

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10times higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.

Sparse-dLLM: 동적 캐시 제거를 통한 Diffusion LLM 가속화

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

초록

Support