상호 강제: 빠른 자기회귀형 오디오-비디오 캐릭터 생성을 위한 이중 모드 자기 진화

초록

본 연구에서는 장기간 오디오-비디오 동기화를 위한 고속 자회귀 오디오-비디오 생성 프레임워크인 Mutual Forcing을 제안한다. 우리의 접근법은 두 가지 핵심 과제, 즉 오디오-비디오 결합 모델링과 고속 자회귀 생성 문제를 해결한다. 오디오-비디오 결합 최적화를 용이하게 하기 위해 2단계 학습 전략을 채택한다: 먼저 단일 모달리티 생성기를 학습시킨 후, 이를 결합하여 짝을 이룬 데이터에 대한 공동 학습을 수행하는 통합 오디오-비디오 모델로 발전시킨다. 스트리밍 생성을 위해 기존의 양방향 모델을 먼저 학습시킨 후 여러 증류 단계를 거쳐 인과적 생성기로 변환하는 일반적인 스트리밍 증류 파이프라인을 따르는 대신, 기본적인 고속 인과적 오디오-비디오 모델을 직접 학습시킬 수 있는지 질문한다. 우리의 해답은 Mutual Forcing으로, 이는 기본 자회귀 모델에 직접 기반을 두며 소수 단계 생성과 다수 단계 생성을 단일 가중치 공유 모델 내에 통합하여 자기 증류와 향상된 학습-추론 일관성을 가능하게 한다. 다수 단계 모드는 자기 증류를 통해 소수 단계 모드를 개선하는 반면, 소수 단계 모드는 학습 중 역사적 문맥을 생성하여 학습-추론 일관성을 향상시킨다. 두 모드가 매개변수를 공유하기 때문에 이러한 두 효과는 단일 모델 내에서 상호 강화된다. Self-Forcing과 같은 기존 접근법과 비교했을 때, Mutual Forcing은 추가적인 양방향 교사 모델의 필요성을 제거하고, 더 유연한 학습 시퀀스 길이를 지원하며, 학습 오버헤드를 줄이고, 모델이 고정된 교사가 아닌 실제 짝을 이룬 데이터로부터 직접 개선될 수 있도록 한다. 실험 결과, Mutual Forcing은 약 50개의 샘플링 단계가 필요한 강력한 기준 모델들을 단 4~8단계만 사용하여 성능을 맞추거나 능가하는 것으로 나타나, 효율성과 품질 모두에서 상당한 이점을 입증했다. 프로젝트 페이지는 https://mutualforcing.github.io에서 확인할 수 있다.

English

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

상호 강제: 빠른 자기회귀형 오디오-비디오 캐릭터 생성을 위한 이중 모드 자기 진화

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

초록

Support