Triplet-Block Diffusion RWKV

Samenvatting

Causale Transformator-taalmodellen lijden onder strikt sequentiële decodering en een kwadratische aandachtskost per stap. Hoewel lineair-tijd causale modellen en discrete diffusiemodellen elk deze zwaktes aanpakken, blijft hun integratie inherent inconsistent: diffusie vereist bidirectionele aandacht, terwijl causale modellen unidirectioneel zijn. Om deze architecturen te verenigen, stellen we B³D-RWKV voor, een diffusie-RWKV-variant die de O(L)-inferentie-efficiëntie van het model integreert met parallelle, bidirectionele discrete-diffusie via een triplet-blokindelingsmethode. B³D-RWKV-7.2B bereikt vergelijkbare nauwkeurigheid op een 8-taaksuite ten opzichte van bestaande modellen, terwijl het baselines aanzienlijk overtreft in decoderingsdoorvoer met een gemiddelde versnelling van 1,6 keer.

English

Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose B^3D-RWKV, a diffusion RWKV variant that integrates the model's O(L) inference efficiency with parallel, bidirectional discrete-diffusion through a triplet-block layout method. B^3D-RWKV-7.2B reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of 1.6times speedup.