V-Express：条件丢弃用于人像视频生成的渐进训练

摘要

在肖像视频生成领域，利用单个图像生成肖像视频的做法日益普遍。一种常见的方法涉及利用生成模型增强适配器以进行受控生成。然而，控制信号（例如文本、音频、参考图像、姿势、深度图等）的强度可能不同。在这些信号中，较弱的条件通常由于受到较强条件的干扰而难以发挥作用，这在平衡这些条件方面构成了挑战。在我们关于肖像视频生成的研究中，我们发现音频信号特别薄弱，常常被面部姿势和参考图像等较强信号所掩盖。然而，直接使用弱信号进行训练通常会导致收敛困难。为了解决这个问题，我们提出了V-Express，这是一种通过渐进训练和条件丢弃操作平衡不同控制信号的简单方法。我们的方法逐渐使弱条件能够有效控制，从而实现同时考虑面部姿势、参考图像和音频的生成能力。实验结果表明，我们的方法能够有效生成由音频控制的肖像视频。此外，我们提供了一个潜在的解决方案，用于同时有效地利用不同强度条件。

English

In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.

V-Express：条件丢弃用于人像视频生成的渐进训练

V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

摘要

Support