Orthrus: geheugenefficiënte parallelle tokengeneratie via dual-view diffusie

Samenvatting

Wij introduceren Orthrus, een eenvoudig en efficiënt duaal-architectuurraamwerk dat de exacte generatiegetrouwheid van autoregressieve Grote Taalmodellen (LLM's) verenigt met de snelle parallelle tokengeneratie van diffusiemodellen. Het sequentiële karakter van standaard autoregressief decoderen vormt een fundamentele bottleneck voor inferentie met hoge doorvoer. Hoewel diffusietaalmodellen deze barrière proberen te doorbreken via parallelle generatie, lijden zij onder aanzienlijke prestatievermindering, hoge trainingskosten en een gebrek aan rigoureuze convergentiegaranties. Orthrus lost deze dichotomie op native wijze op. Ontworpen om naadloos te integreren in bestaande Transformers, breidt het raamwerk een bevroren LLM uit met een lichtgewicht, trainbare module om een parallelle diffusieweergave naast de standaard autoregressieve weergave te creëren. In dit verenigde systeem hebben beide weergaven toegang tot exact dezelfde high-fidelity Key-Value (KV)-cache; de autoregressieve kop voert contextprefilling uit om nauwkeurige KV-representaties te construeren, terwijl de diffusiekop parallelle generatie uitvoert. Door een exact consensusmechanisme tussen de twee weergaven te gebruiken, garandeert Orthrus verliesvrije inferentie, met een snelheidswinst tot 7,8x bij slechts een O(1)-geheugencache-overhead en minimale parameteruitbreidingen.

English

We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.