dKV-Cache: 확산 언어 모델을 위한 캐시

초록

확산 언어 모델(Diffusion Language Models, DLMs)은 자동회귀 언어 모델의 유망한 경쟁자로 여겨져 왔습니다. 그러나 확산 언어 모델은 오랜 기간 느린 추론 속도로 인해 제약을 받아왔습니다. 핵심적인 문제는 이들의 비자동회귀적 아키텍처와 양방향 어텐션이 디코딩을 가속화하는 키-값 캐시(key-value cache)를 사용할 수 없게 한다는 점입니다. 우리는 이러한 병목 현상을 해결하기 위해 DLM의 노이즈 제거(denoising) 과정을 위한 KV 캐시와 유사한 메커니즘인 지연된 KV 캐시(delayed KV-Cache)를 제안합니다. 이 접근법은 확산 과정에서 서로 다른 토큰들이 각기 다른 표현 동역학을 보인다는 관찰에 기반을 두고 있습니다. 이에 따라, 우리는 키와 값 상태에 대한 지연되고 조건부 캐싱 전략을 제안합니다. 우리는 키와 값을 단계별로 캐싱하기 위해 두 가지 상호 보완적인 변형을 설계했습니다: (1) dKV-Cache-Decode는 거의 손실 없는 가속을 제공하며, 긴 시퀀스에서 성능을 개선하기까지 하여, 기존 DLM이 추론 중에 컨텍스트 정보를 충분히 활용하지 못하고 있음을 시사합니다. (2) dKV-Cache-Greedy는 수명이 단축된 공격적인 캐싱을 통해 더 높은 속도 향상을 달성하지만, 일부 성능 저하를 감수하면서 2차 시간 복잡도를 가집니다. 최종적으로, dKV-Cache는 추론 속도에서 2~10배의 가속을 달성하여 AR(자동회귀 모델)과 DLM 간의 격차를 크게 좁혔습니다. 우리는 dKV-Cache를 여러 벤치마크에서 평가하며, 일반 언어 이해, 수학적 문제 해결, 코드 생성 벤치마크 전반에 걸쳐 가속 효과를 입증했습니다. 실험 결과는 캐시가 현재 DLM에서도 훈련 없이 사용될 수 있음을 보여줍니다.

English

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

dKV-Cache: 확산 언어 모델을 위한 캐시

dKV-Cache: The Cache for Diffusion Language Models

초록

Support