唇部强制：少步自回归扩散实现实时唇形同步

摘要

基於擴散的唇同步模型在視覺品質和視聽同步方面表現出色，但其全序列雙向注意力機制以及大量的去噪步驟使其難以應用於即時推理。我們提出 Lip Forcing，據我們所知，這是第一個用於影片到影片（V2V）唇同步的自回歸擴散方法，該方法將一個 14B 參數的音訊條件雙向影片擴散教師模型蒸餾為因果學生模型。在推理時，學生模型僅需兩步去噪即可生成每個片段，且無需推理階段的 CFG，從而實現即時唇同步。針對唇同步的教師軌跡分析揭示了一個 CFG 保真度-同步權衡：無 CFG 預測傾向於參考保真度，而 CFG 引導預測則傾向於在中軌跡帶內實現同步。Lip Forcing 將這一發現轉化為三個分析驅動的組件：Sync-Window DMD、一個兩步推理排程以及一個基於 SyncNet 的獎勵函數。我們在兩種規模的學生模型上驗證了 Lip Forcing，兩者均從 14B 教師模型蒸餾而來。1.3B 的學生模型以 31 FPS 的速度實現即時串流，比同規模的雙向模型快 17.6 倍。而 14B 的學生模型——這是迄今為止報導中最大的用於 V2V 唇同步的擴散模型——在可比的參考保真度下，運行速度比其教師模型快 39.8 倍。兩種規模的首幀延遲均低於 1 毫秒，遠低於所有擴散基準模型。

English

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, 17.6times faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs 39.8times faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.