相互強制：高速自己回帰型音声映像キャラクター生成のためのデュアルモード自己進化

要旨

本研究では、長期的な音声-映像同期を実現する高速自己回帰型音声-映像生成フレームワーク「Mutual Forcing」を提案する。本手法は、音声-映像の共同モデリングと高速自己回帰生成という2つの重要課題に取り組む。音声-映像の共同最適化を容易にするため、2段階の学習戦略を採用する。まず単模态生成器を学習し、その後ペアデータを用いて統合音声-映像モデルとして結合して共同学習を行う。ストリーミング生成については、従来の双方向モデルを先に学習し複数の蒸留段階を経て因果的生成器に変換する手法とは異なり、ネイティブな高速因果的音声-映像モデルを直接学習できるかという問いに着目する。我々の答えがMutual Forcingであり、これはネイティブな自己回帰モデルに直接基づき、少数ステップ生成と多ステップ生成を単一の重み共有モデル内に統合することで、自己蒸留と学習-推論一貫性の向上を実現する。多ステップモードは自己蒸留により少数ステップモードを改善し、少数ステップモードは学習時に履歴文脈を生成することで学習-推論一貫性を向上させる。両モードはパラメータを共有するため、これらの効果が単一モデル内で相互に強化される。Self-Forcingなどの従来手法と比較し、Mutual Forcingは追加の双方向教師モデルが不要であり、より柔軟な学習系列長をサポートし、学習オーバーヘッドを削減し、固定された教師モデルではなく実ペアデータから直接改善できる。実験では、Mutual Forcingが約50サンプリングステップを要する強力なベースラインを、わずか4～8ステップで同等または上回る性能を示し、効率と品質の両面で大きな優位性を実証した。プロジェクトページはhttps://mutualforcing.github.ioで公開されている。

English

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

相互強制：高速自己回帰型音声映像キャラクター生成のためのデュアルモード自己進化

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

要旨

Support