d^2Cache: 이중 적응형 캐싱을 통한 확산 기반 LLM 가속화

초록

확산 기반 대규모 언어 모델(dLLMs)은 우수한 성능을 보이지만 여전히 추론 효율성 측면에서 한계를 보입니다. 이는 dLLMs가 양방향 주의 메커니즘에 의존하며, 자기회귀 모델(ARMs)과 달리 표준 키-값(KV) 캐시를 직접 활용할 수 없기 때문입니다. 이러한 문제를 해결하기 위해, 우리는 dLLM 추론 가속화를 위한 학습이 필요 없는 근사 KV 캐시 프레임워크인 Dual Adaptive Cache(d^2Cache)를 제안합니다. d^2Cache는 각 디코딩 단계에서 토큰을 식별하고 그들의 KV 상태를 적응적으로 업데이트하는 두 단계의 세밀한 선택 전략을 특징으로 하며, 나머지 토큰의 KV 상태는 재사용을 위해 캐싱합니다. 더 나아가, d^2Cache는 더 신뢰할 수 있는 디코딩 대안을 자연스럽게 제공함으로써, 준 좌-우 생성(quasi left-to-right generation)을 가능하게 하고 시퀀스 끝 부분의 토큰에 대한 조기 과신을 완화합니다. 두 가지 대표적인 dLLM(LLaDA와 Dream)에 대한 광범위한 실험 결과는 d^2Cache가 추론 속도를 크게 향상시킬 뿐만 아니라 생성 품질에서도 일관된 개선을 보여줍니다. 코드는 https://github.com/Kamichanw/d2Cache에서 확인할 수 있습니다.

English

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce Dual aDaptive Cache (d^2Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d^2Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d^2Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d^2Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

d^2Cache: 이중 적응형 캐싱을 통한 확산 기반 LLM 가속화

d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

초록

Support