Any2Caption: 任意の条件を解釈してキャプションを生成し、制御可能なビデオ生成を実現

要旨

現在のビデオ生成コミュニティにおける正確なユーザー意図解釈のボトルネックに対処するため、我々はAny2Captionを提案します。これは、あらゆる条件下での制御可能なビデオ生成のための新しいフレームワークです。その核となるアイデアは、様々な条件解釈ステップをビデオ合成ステップから分離することです。Any2Captionは、現代のマルチモーダル大規模言語モデル（MLLMs）を活用して、テキスト、画像、ビデオ、および領域、動き、カメラポーズなどの特殊なキューといった多様な入力を、高密度で構造化されたキャプションに解釈します。これにより、バックボーンビデオジェネレーターにより良いガイダンスを提供します。また、我々はAny2CapInsを紹介します。これは、337Kのインスタンスと407Kの条件を含む大規模なデータセットで、あらゆる条件からキャプションへの指示チューニングに使用されます。包括的な評価により、我々のシステムが既存のビデオ生成モデルの様々な側面において、制御性とビデオ品質の大幅な向上を示すことが実証されました。プロジェクトページ: https://sqwu.top/Any2Cap/

English

To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/

Any2Caption: 任意の条件を解釈してキャプションを生成し、制御可能なビデオ生成を実現

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

要旨

Support