Bernini: 비디오 확산을 위한 잠재 의미 계획

초록

다중 모달 대규모 언어 모델(MLLM)과 확산 모델은 각각 놀라운 성숙도에 도달했습니다: MLLM은 강력한 의미적 근거를 바탕으로 이질적인 다중 모달 입력에 대한 추론에 탁월한 반면, 확산 모델은 사실적 충실도로 이미지와 비디오를 합성합니다. 우리는 이 두 계열이 간단한 역할 분담을 통해 통합될 수 있다고 주장합니다. 즉, MLLM은 의미론적 계획을 수행하고, 확산 모델은 높은 수준의 의미적 지침과 낮은 수준의 시각적 특징으로부터 픽셀을 렌더링합니다. 이 아이디어를 바탕으로, 우리는 비디오 생성 및 편집을 위한 통합 프레임워크인 Bernini를 제안합니다. MLLM 기반 계획자는 ViT 임베딩 공간에서 목표 의미 표현을 직접 예측하고, DiT 기반 렌더러는 이 계획에 따라 픽셀을 합성하며, 텍스트 특징과 편집의 경우 세부 정보 보존을 위한 소스 VAE 특징으로 보강됩니다. 의미가 인터페이스 역할을 하기 때문에, 계획자와 렌더러는 별도로 훈련될 수 있으며 약간의 공동 훈련만으로도 두 구성 요소의 사전 훈련된 강점을 유지하면서 훈련 효율성을 유지합니다. 다중 시각적 입력을 더 잘 처리하기 위해, 우리는 세그먼트 인식 3D 회전 위치 임베딩(SA-3D RoPE)을 도입하고, 계획자에 사고 사슬 추론을 추가로 통합하여 이해를 생성으로 더 잘 전이합니다. Bernini는 다양한 비디오 생성 및 편집 벤치마크에서 최첨단 성능을 달성하며, MLLM의 사전 훈련된 이해가 도전적인 편집 작업에서 강력한 일반화로 이어집니다.

English

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.