MIDAS: リアルタイム自己回帰型ビデオ生成によるマルチモーダル対話型デジタルヒューマン合成

要旨

近年、インタラクティブなデジタルヒューマンビデオ生成が広く注目を集め、顕著な進展を遂げています。しかし、多様な入力信号とリアルタイムで相互作用できる実用的なシステムを構築することは、既存の手法にとって依然として課題であり、高いレイテンシ、重い計算コスト、制御性の限界に悩まされることが多いです。本研究では、ストリーミング方式でインタラクティブなマルチモーダル制御と低レイテンシの外挿を可能にする自己回帰型ビデオ生成フレームワークを提案します。標準的な大規模言語モデル（LLM）に最小限の変更を加えることで、本フレームワークは音声、ポーズ、テキストを含むマルチモーダル条件エンコーディングを受け入れ、拡散ヘッドのノイズ除去プロセスを導く空間的・意味的に一貫した表現を出力します。これを支援するため、複数のソースから約20,000時間の大規模な対話データセットを構築し、トレーニングのための豊富な会話シナリオを提供します。さらに、最大64倍の圧縮率を実現する深層圧縮オートエンコーダを導入し、自己回帰モデルの長期間推論負担を効果的に軽減します。双方向会話、多言語人間合成、インタラクティブな世界モデルに関する広範な実験を通じて、本アプローチの低レイテンシ、高効率、きめ細かいマルチモーダル制御性の利点を強調します。

English

Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with high latency, heavy computational cost, and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64times reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.

MIDAS: リアルタイム自己回帰型ビデオ生成によるマルチモーダル対話型デジタルヒューマン合成

MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation

要旨

Support