립 포싱: 실시간 입술 동기화를 위한 소수 단계 자기회귀 확산

초록

확산 기반 입술 동기화 모델은 뛰어난 시각적 품질과 시청각 정렬을 달성하지만, 전체 시퀀스 양방향 어텐션과 많은 잡음 제거 단계로 인해 실시간 추론에 실용적이지 않습니다. 우리는 연구진이 아는 한 비디오-투-비디오(V2V) 입술 동기화를 위한 최초의 자기회귀 확산 방법인 Lip Forcing을 제시하며, 이는 14B 오디오 조건부 양방향 비디오 확산 교사 모델을 인과적 학생 모델로 증류합니다. 추론 시 학생 모델은 추론 시간 CFG 없이 단 두 번의 잡음 제거 단계만으로 각 청크를 생성하여 실시간 입술 동기화를 가능하게 합니다. 입술 동기화 특화 교사 궤적 분석은 CFG 충실도-동기화 트레이드오프를 밝혀냅니다: CFG 없는 예측은 참조 충실도를 선호하는 반면, CFG 유도 예측은 중간 궤적 대역 내에서 동기화를 선호합니다. Lip Forcing은 이 발견을 세 가지 분석 기반 구성 요소, 즉 Sync-Window DMD, 두 단계 추론 일정, SyncNet 기반 보상으로 변환합니다. 우리는 14B 교사로부터 증류된 두 가지 규모의 학생 모델에서 Lip Forcing을 검증합니다. 1.3B 학생 모델은 31 FPS로 실시간 스트리밍에 도달하며, 동일 규모 양방향 모델보다 17.6배 빠릅니다. 14B 학생 모델은 V2V 입술 동기화에 대해 보고된 가장 큰 확산 모델로, 비교 가능한 참조 충실도에서 교사보다 39.8배 빠르게 실행됩니다. 첫 프레임까지의 시간은 두 규모 모두에서 서브 밀리초로, 모든 확산 기준선보다 훨씬 낮습니다.

English

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, 17.6times faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs 39.8times faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.