Lip Forcing: Weinig-staps autoregressieve diffusie voor real-time lipsynchronisatie

Samenvatting

Op diffusie gebaseerde lipleesynchronisatiemodellen bereiken een sterke visuele kwaliteit en audiovisuele afstemming, maar volledige-sequentie bidirectionele aandacht en vele ruisverwijderingsstappen maken ze onpraktisch voor realtime inferentie. Wij presenteren Lip Forcing, naar ons weten de eerste autoregressieve diffusiemethode voor video-naar-video (V2V) lipleesynchronisatie, die een 14B audio-geconditioneerde bidirectionele videodiffusieleermeester destilleert naar causale studenten. Bij inferentie genereren de studenten elk chunk in slechts twee ruisverwijderingsstappen zonder inferentie-CFG, wat realtime lipleesynchronisatie mogelijk maakt. Een lipleespecifieke leermeestertrajectanalyse onthult een CFG-getrouwheid-sync-afweging: voorspellingen zonder CFG begunstigen referentiegetrouwheid, terwijl CFG-gestuurde voorspellingen synchronisatie begunstigen binnen een middenband van het traject. Lip Forcing vertaalt deze bevinding naar drie uit de analyse afgeleide componenten: Sync-Window DMD, een tweetraps inferentieplanning en een op SyncNet gebaseerde beloning. Wij valideren Lip Forcing op twee studentschalen, beide gedestilleerd van de 14B leermeester. De 1.3B student bereikt realtime streaming bij 31 FPS, 17,6 keer sneller dan zijn bidirectionele model van dezelfde schaal. De 14B student, het grootste diffusiemodel dat is gerapporteerd voor V2V lipleesynchronisatie, draait 39,8 keer sneller dan zijn leermeester bij vergelijkbare referentiegetrouwheid. De tijd tot het eerste frame is submilliseconde op beide schalen, ver onder elke diffusie-baseline.

English

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, 17.6times faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs 39.8times faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.