互强：面向快速自回归音视频角色生成的双模式自进化

摘要

在本研究中，我们提出互促生成框架（Mutual Forcing），用于实现长时域音视频同步的快速自回归音视频生成。该方法着力解决两大核心挑战：联合音视频建模与快速自回归生成。为简化联合音视频优化，我们采用两阶段训练策略：先训练单模态生成器，再将其耦合为统一音视频模型进行配对数据联合训练。针对流式生成需求，我们探索能否直接训练原生快速因果音视频模型，而非沿用现有流式蒸馏流程（通常需先训练双向模型，再通过多阶段蒸馏转为因果生成器）。我们的解决方案即互促生成——该框架直接基于原生自回归模型，将少步生成与多步生成整合于单一权重共享模型中，实现自蒸馏与训练-推理一致性的提升。多步模式通过自蒸馏优化少步生成效果，而少步模式在训练时生成历史上下文以提升训练-推理一致性；由于两种模式共享参数，这两种效果在单一模型内形成相互促进机制。相较于Self-Forcing等现有方法，互促生成无需额外双向教师模型，支持更灵活的训练序列长度，降低训练开销，并允许模型直接从真实配对数据而非固定教师模型中学习改进。实验表明，互促生成在仅使用4至8步采样的情况下，即可达到或超越需约50步采样的强基线模型，在效率与质量上均展现出显著优势。项目页面详见https://mutualforcing.github.io。

English

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

互强：面向快速自回归音视频角色生成的双模式自进化

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

摘要

Support