ChatPaper.aiChatPaper

CogOmniControl:基於創意意圖認知的推理驅動可控視頻生成

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

May 19, 2026
作者: Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu, Jianbing Shen
cs.AI

摘要

最近的擴散模型在影片生成中展現出極佳的逼真寫實性與流暢度,但在處理抽象、稀疏或複雜條件時仍顯脆弱,因此在專業製作流程(如分鏡草圖與黏土渲染條件)中表現不佳。現有的影片生成模型,無論是透過適配器注入條件,或是將通用視覺語言模型與擴散主幹結合,皆存在能力落差,無法產出符合使用者創作意圖的影片。我們提出CogOmniControl,這是一個以推理驅動的框架,將可控影片生成分解為創意意圖認知與生成兩個部分。具體而言,我們使用真實動畫製作資料訓練一個專用的CogVLM。與通用視覺語言模型相比,它能產生更專業且清晰的輸出,準確地從稀疏抽象條件中認知使用者的創意意圖,並將這些線索轉化為稠密的推理輸出。此外,CogOmniDiT透過情境生成統一來自各種條件的控制,並透過強化學習與CogVLM的推理輸出對齊。進一步地,我們利用CogVLM在引導影片生成方面的強大能力,釋放其在規劃特定評估器上的潛力,並實現對生成影片的「最佳N選」機制。此整合將整個框架轉變為一個封閉迴路的「馬具式」架構。我們進一步介紹了CogReasonBench與CogControlBench,這兩個基準是基於承載真實創意意圖(而非模擬意圖)的專業工作流程資料建構而成。在兩個基準上的實驗顯示,CogOmniControl超越了現有的開源模型。專案網站:https://um-lab.github.io/CogOmniControl/
English
Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/