V-Express: ポートレート動画生成のためのプログレッシブトレーニングにおける条件付きドロップアウト

要旨

ポートレート動画生成の分野において、単一画像を用いたポートレート動画生成がますます普及しています。一般的なアプローチとして、生成モデルを活用して制御生成のためのアダプターを強化する方法が挙げられます。しかし、制御信号（例：テキスト、音声、参照画像、ポーズ、深度マップなど）の強度は様々です。これらのうち、弱い条件は強い条件からの干渉により効果を発揮しにくく、これらの条件のバランスを取ることが課題となっています。私たちのポートレート動画生成に関する研究では、音声信号が特に弱く、顔のポーズや参照画像などの強い信号に埋もれがちであることを明らかにしました。しかし、弱い信号を用いた直接的なトレーニングは、収束が困難になることが多いです。この問題を解決するために、私たちはV-Expressを提案します。これは、段階的なトレーニングと条件付きドロップアウト操作を通じて異なる制御信号のバランスを取るシンプルな方法です。私たちの方法は、弱い条件による効果的な制御を徐々に可能にし、顔のポーズ、参照画像、音声を同時に考慮した生成能力を実現します。実験結果は、私たちの方法が音声によって制御されたポートレート動画を効果的に生成できることを示しています。さらに、強度の異なる条件を同時かつ効果的に使用するための潜在的な解決策を提供します。

English

In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.

V-Express: ポートレート動画生成のためのプログレッシブトレーニングにおける条件付きドロップアウト

V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

要旨

Support