OmniForcing: De Ontketening van Real-time Gezamenlijke Audio-Visuele Generatie

Samenvatting

Recente gezamenlijke audio-visuele diffusiemodellen bereiken opmerkelijke generatiekwaliteit, maar lijden onder hoge latentie vanwege hun bidirectionele aandachtafhankelijkheden, wat realtime-toepassingen belemmert. Wij stellen OmniForcing voor, het eerste raamwerk om een offline, dual-stream bidirectioneel diffusiemodel te destilleren tot een hoogwaardige streaming autoregressieve generator. Echter, een naïeve toepassing van causale distillatie op dergelijke dual-stream architecturen veroorzaakt ernstige trainingsinstabiliteit, door de extreme temporele asymmetrie tussen modaliteiten en de resulterende tokenschaarste. Wij adresseren de inherente informatiedichtheidskloof door een Asymmetrische Blok-Causale Uitlijning te introduceren met een nul-truncatie Globale Prefix die multi-modale synchronisatiedrift voorkomt. De gradientexplosie veroorzaakt door extreme audiotokenschaarste tijdens de causale verschuiving wordt verder opgelost door een Audio Sink Token-mechanisme uitgerust met een Identiteit RoPE-beperking. Ten slotte stelt een Gezamenlijke Self-Forcing Distillatie-paradigma het model in staat om cumulatieve cross-modale fouten van exposure bias tijdens lange rollouts dynamisch te autocorrigeren. Gesterkt door een modaliteit-onafhankelijk rolling KV-cache inferentieschema bereikt OmniForcing state-of-the-art streaminggeneratie op sim25 FPS op een enkele GPU, waarbij multi-modale synchronisatie en visuele kwaliteit gelijk blijven aan die van de bidirectionele leraar.Projectpagina: https://omniforcing.com{https://omniforcing.com}

English

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at sim25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.Project Page: https://omniforcing.com{https://omniforcing.com}

OmniForcing: De Ontketening van Real-time Gezamenlijke Audio-Visuele Generatie

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Samenvatting

Support