d^2Cache: Versnelling van diffusiegebaseerde LLM's via duale adaptieve caching

Samenvatting

Diffusie-gebaseerde grote taalmodellen (dLLMs), ondanks hun veelbelovende prestaties, hebben nog steeds te kampen met inferentie-efficiëntie van mindere kwaliteit. Dit komt doordat dLLMs afhankelijk zijn van bidirectionele aandacht en niet direct kunnen profiteren van de standaard key-value (KV) cache zoals autoregressieve modellen (ARMs) dat wel kunnen. Om dit probleem aan te pakken, introduceren we Dual aDaptive Cache (d^2Cache), een trainingsvrij, benaderend KV cache-framework voor het versnellen van dLLM-inferentie. d^2Cache beschikt over een tweestaps fijnmazige selectiestrategie om tokens te identificeren en hun KV-statussen adaptief bij te werken bij elke decodeerstap, terwijl de KV-statussen van de overige tokens worden gecached voor hergebruik. Bovendien biedt d^2Cache van nature een betrouwbaarder decodeeralternatief, dat quasi links-naar-rechts generatie mogelijk maakt en voortijdige overmoedigheid in tokens aan het einde van de reeks kan verminderen. Uitgebreide experimentele resultaten op twee representatieve dLLMs (\ie, LLaDA en Dream) tonen aan dat d^2Cache niet alleen aanzienlijke inferentieversnellingen bereikt, maar ook consistente verbeteringen in de generatiekwaliteit oplevert. De code is beschikbaar op https://github.com/Kamichanw/d2Cache.

English

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce Dual aDaptive Cache (d^2Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d^2Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d^2Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d^2Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

d^2Cache: Versnelling van diffusiegebaseerde LLM's via duale adaptieve caching

d^2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Samenvatting

Support