CogOmniControl: 推論駆動による創造的意図認識に基づく制御可能な動画生成

要旨

最近の拡散モデルは、動画生成において高いフォトリアリズムと流暢さを達成していますが、抽象的で疎な条件や複雑な条件下では脆弱であり、そのため、ストーリーボードスケッチやクレイレンダリング条件などのプロフェッショナルな制作ワークフローにおいて性能が低くなっています。既存の動画生成モデルは、アダプターを介して条件を注入するか、汎用視覚言語モデル（VLM）を拡散バックボーン内に結合するかのいずれかであり、能力のギャップが生じ、ユーザーの創造意図に沿った動画を生成できません。本論文では、制御可能な動画生成を創造意図の認知と生成に分解する、推論駆動型フレームワークであるCogOmniControlを提案します。具体的には、実際のアニメ制作データを用いて専門化されたCogVLMを訓練します。汎用VLMと比較して、より専門的で明確な出力を生成し、疎で抽象的な条件からユーザーの創造意図を正確に認知し、これらの手がかりを高密度な推論出力に調整します。さらに、CogOmniDiTはコンテキスト内生成を通じてさまざまな条件からの制御を統一し、強化学習を介してCogVLMの推論出力に整合させます。さらに、CogVLMの動画生成を導く堅牢な能力を活用し、特定の評価器を計画する可能性を引き出し、生成された動画に対してBest-of-N選択を可能にします。この統合により、フレームワーク全体が閉ループの「ハーネス的」アーキテクチャに変貌します。さらに、シミュレーションされたものではなく真の創造意図を伴うプロフェッショナルなワークフローデータから構築されたCogReasonBenchとCogControlBenchを導入します。2つのベンチマークでの実験により、CogOmniControlが既存のオープンソースモデルを凌駕することが示されました。プロジェクトウェブサイト: https://um-lab.github.io/CogOmniControl/

English

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/