互促式雙模態自演進：快速自回歸音視頻角色生成

摘要

本研究提出「互促生成」框架，專注於實現長時序音視頻同步的快速自迴歸生成。該方法著力解決兩大關鍵挑戰：聯合音視頻建模與快速自迴歸生成。為簡化聯合優化過程，我們採用兩階段訓練策略：先訓練單模態生成器，再將其耦合為統一模型進行配對數據的聯合訓練。針對流式生成需求，我們探索能否直接訓練原生快速因果音視頻模型，而非沿用現行需先訓練雙向模型、再通過多輪蒸餾轉換為因果生成器的流水線。互促生成給出肯定答案——該框架直接基於原生自迴歸模型，在單個權重共享模型中整合少步生成與多步生成機制，實現自我蒸餾並提升訓練-推理一致性。多步模式通過自我蒸餾提升少步模式性能，而少步模式在訓練時生成歷史上下文以增強一致性；由於兩種模式參數共享，這種促進效應在單一模型內形成良性循環。相較於Self-Forcing等既有方法，互促生成無需額外雙向教師模型，支持更靈活的訓練序列長度，降低訓練開銷，並能直接從真實配對數據中學習而非依賴固定教師模型。實驗表明，互促生成在僅使用4至8步採樣的情況下，即可匹配或超越需約50步採樣的強基線模型，在效率與質量層面均展現顯著優勢。項目頁面請見：https://mutualforcing.github.io。

English

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

互促式雙模態自演進：快速自回歸音視頻角色生成

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

摘要

Support