CogOmniControl: 창의적 의도 인지를 통한 추론 기반 제어 가능 비디오 생성

초록

최근 확산 모델은 비디오 생성에서 강력한 포토리얼리즘과 유창성을 달성했지만, 추상적이고 희소하거나 복잡한 조건에서는 여전히 취약하여 스토리보드 스케치나 클레이 렌더 조건과 같은 전문 제작 워크플로우에서 성능이 저조합니다. 기존 비디오 생성 모델은 어댑터를 통해 조건을 주입하거나 확산 백본 내에 일반적인 시각-언어 모델(VLM)을 결합하는 방식으로, 창의적 의도에 부합하는 비디오를 생성하지 못하는 역량 격차가 존재합니다. 본 논문에서는 추론 기반 프레임워크인 CogOmniControl을 제안합니다. 이는 제어 가능한 비디오 생성을 창의적 의도 인식과 생성으로 분해합니다. 구체적으로, 우리는 실제 애니메이션 제작 데이터를 사용하여 특화된 CogVLM을 학습시킵니다. 일반 VLM과 비교하여 CogVLM은 더 전문적이고 명확한 출력을 생성하며, 희소하고 추상적인 조건에서 사용자의 창의적 의도를 정확히 인식하고 이러한 단서를 밀집된 추론 출력으로 변환합니다. 또한 CogOmniDiT는 맥락 내 생성을 통해 다양한 조건의 제어를 통합하고, 강화 학습을 통해 CogVLM의 추론 출력에 정렬됩니다. 나아가, CogVLM의 강력한 비디오 생성 가이드 능력을 활용하여 특정 평가자를 계획하는 데 그 잠재력을 발휘하고, 생성된 비디오에 대한 Best-of-N 선택을 가능하게 합니다. 이러한 통합은 전체 프레임워크를 폐쇄 루프의 "하네스(harness)" 형태 아키텍처로 변환합니다. 또한 전문 워크플로우 데이터로부터 구축된 CogReasonBench와 CogControlBench를 소개하며, 이는 모의된 것이 아닌 진정한 창의적 의도를 담고 있습니다. 두 벤치마크에 대한 실험 결과, CogOmniControl은 기존 오픈소스 모델을 능가했습니다. 프로젝트 웹사이트: https://um-lab.github.io/CogOmniControl/

English

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/