CogOmniControl:通过创意意图认知实现推理驱动的可控视频生成
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
May 19, 2026
作者: Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu, Jianbing Shen
cs.AI
摘要
近年来,扩散模型在视频生成中展现出高度的逼真度和流畅性,但在处理抽象、稀疏或复杂条件时仍显脆弱,导致故事板草图、黏土渲染条件等专业制作流程表现不佳。现有视频生成模型要么通过适配器注入条件,要么在扩散骨干中耦合通用视觉语言模型(VLM),这造成了能力差距,难以生成契合用户创作意图的视频。本文提出CogOmniControl——一个推理驱动的框架,将可控视频生成分解为创作意图认知与生成两个环节。具体而言,我们利用真实动漫制作数据训练了专用CogVLM。相较于通用VLM,它能从稀疏抽象条件中准确认知用户创作意图,并生成更专业清晰的输出,将这些线索转化为稠密推理结果。此外,CogOmniDiT通过上下文生成统一多种条件的控制,并借助强化学习与CogVLM的推理输出对齐。进一步,利用CogVLM在指导视频生成中的强大能力,我们释放了其规划特定评估器的潜力,实现对生成视频的"N选一最优选择"。这种整合将整个框架转化为闭环的"类似缰绳"架构。我们进一步构建了CogReasonBench与CogControlBench,这些基准基于专业制作流程数据,承载真实创作意图而非模拟数据。在两个基准上的实验表明,CogOmniControl超越了现有开源模型。项目网站:https://um-lab.github.io/CogOmniControl/
English
Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/