VChain: 비디오 생성 추론을 위한 시각적 사고의 연쇄

초록

최근의 비디오 생성 모델은 부드럽고 시각적으로 매력적인 클립을 생성할 수 있지만, 종종 일관된 결과의 연쇄를 가진 복잡한 역학을 합성하는 데 어려움을 겪습니다. 시간에 따른 시각적 결과와 상태 전환을 정확하게 모델링하는 것은 여전히 핵심적인 과제로 남아 있습니다. 반면, 대규모 언어 및 멀티모달 모델(예: GPT-4o)은 강력한 시각적 상태 추론 및 미래 예측 능력을 보여줍니다. 이러한 강점을 결합하기 위해, 우리는 VChain이라는 새로운 추론 시각적 사고 연쇄(chain-of-visual-thought) 프레임워크를 소개합니다. VChain은 멀티모달 모델로부터 시각적 추론 신호를 비디오 생성에 주입하는 전용 파이프라인을 포함합니다. 구체적으로, VChain은 대규모 멀티모달 모델을 활용하여 중요한 키프레임의 희소 집합을 스냅샷으로 생성한 다음, 이 키프레임을 사용하여 사전 훈련된 비디오 생성기의 희소 추론 시점 튜닝을 이 순간에만 유도합니다. 우리의 접근 방식은 튜닝 효율적이며, 최소한의 오버헤드를 도입하고 밀집된 감독을 피합니다. 복잡한 다단계 시나리오에 대한 광범위한 실험을 통해 VChain이 생성된 비디오의 품질을 크게 향상시킨다는 것을 보여줍니다.

English

Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

VChain: 비디오 생성 추론을 위한 시각적 사고의 연쇄

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

초록

Support