Residuale Kontext-Diffusionssprachmodelle

papers.abstract

Diffusion Large Language Models (dLLMs) haben sich als vielversprechende Alternative zu rein autoregressiven Sprachmodellen etabliert, da sie mehrere Token parallel decodieren können. State-of-the-Art blockweise dLLMs sind jedoch auf einen "Remasking"-Mechanismus angewiesen, der nur die Tokens mit der höchsten Konfidenz decodiert und den Rest verwirft, was Rechenleistung effektiv verschwendet. Wir zeigen, dass die Wiederverwertung der Berechnungen der verworfenen Tokens vorteilhaft ist, da diese Token kontextuelle Informationen enthalten, die für nachfolgende Decodieriterationen nützlich sind. Vor diesem Hintergrund schlagen wir Residual Context Diffusion (RCD) vor, ein Modul, das diese verworfenen Token-Repräsentationen in kontextuelle Residuen umwandelt und sie für den nächsten Denoising-Schritt wieder zurückinjiziert. RCD verwendet eine entkoppelte Zwei-Phasen-Trainingspipeline, um die mit Backpropagation verbundenen Speicher-Engpässe zu umgehen. Wir validieren unsere Methode sowohl an Modellen für langes CoT-Reasoning (SDAR) als auch für kurzes CoT-Instruction-Following (LLaDA). Wir zeigen, dass ein Standard-dLLM mit nur ~1 Milliarde Token effizient in das RCD-Paradigma umgewandelt werden kann. RCD verbessert durchgängig die Leistung von führenden dLLMs um 5-10 Genauigkeitspunkte bei minimalem zusätzlichem Rechenaufwand über eine breite Palette von Benchmarks hinweg. Besonders bemerkenswert ist, dass RCD bei den anspruchsvollsten AIME-Aufgaben die Baseline-Genauigkeit nahezu verdoppelt und bei gleichen Genauigkeitsniveaus bis zu 4-5 mal weniger Denoising-Schritte erreicht.

English

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.

Residuale Kontext-Diffusionssprachmodelle

Residual Context Diffusion Language Models

papers.abstract

Support