リップフォーシング：リアルタイム口唇同期のための数ステップ自己回帰拡散

要旨

拡散ベースの口唇同期モデルは、高い画質と音声・映像の同期を実現するものの、全シーケンス双方向注意機構と多数のノイズ除去ステップにより、リアルタイム推論には適していません。本稿では、我々の知る限り初の自己回帰拡散法による動画間（V2V）口唇同期手法であるLip Forcingを提案します。本手法は、14Bパラメータの音声条件付き双方向ビデオ拡散教師を因果的学生モデルへと蒸留します。推論時には、学生モデルが各チャンクをわずか2ステップのノイズ除去で生成し、推論時のCFG（Classifier-Free Guidance）を必要としないため、リアルタイムの口唇同期を実現します。口唇同期に特化した教師軌跡分析により、CFGに関する忠実度と同期のトレードオフが明らかになりました。すなわち、CFGを用いない予測は参照忠実度を優先し、CFGを用いた予測は中期軌跡帯において同期を優先します。Lip Forcingはこの知見を、Sync-Window DMD、2ステップ推論スケジュール、SyncNetに基づく報酬という3つの分析由来のコンポーネントへと変換します。本手法は二つの学生規模で検証し、いずれも14B教師から蒸留しています。1.3Bの学生モデルは31FPSでリアルタイムストリーミングを達成し、同規模の双方向モデルと比較して17.6倍高速です。14Bの学生モデルは、V2V口唇同期において報告された最大の拡散モデルであり、同等の参照忠実度において教師より39.8倍高速です。両規模において初回フレーム出力時間は1ミリ秒未満であり、すべての拡散ベースラインを大幅に下回ります。

English

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, 17.6times faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs 39.8times faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.