UniVid: 오픈소스 통합 비디오 모델

초록

생성과 이해 능력을 결합한 통합 비디오 모델링은 점점 더 중요해지고 있지만 두 가지 주요 과제에 직면해 있습니다: 텍스트-시각적 토큰 불균형으로 인한 흐름 기반 생성 과정에서의 의미적 충실도 유지와 흐름 궤적 전반에 걸친 균일한 교차 모달 어텐션의 한계, 그리고 비용이 많이 드는 재학습 없이 이미지 중심의 MLLM(Multimodal Large Language Model)을 비디오로 효율적으로 확장하는 문제입니다. 우리는 UniVid를 제안합니다. 이는 경량 어댑터를 통해 MLLM과 디퓨전 디코더를 결합한 통합 아키텍처로, 비디오 이해와 생성을 모두 가능하게 합니다. 우리는 프롬프트 준수를 개선하기 위한 온도 모달리티 정렬(Temperature Modality Alignment)과 동적 키프레임 선택을 통한 효율적인 시간적 추론을 위한 피라미드 리플렉션(Pyramid Reflection)을 도입했습니다. 표준 벤치마크에서의 광범위한 실험을 통해 최첨단 성능을 입증했으며, VBench-Long 총점에서 EasyAnimateV5.1 대비 2.2% 향상, MSVD-QA와 ActivityNet-QA에서 각각 1.0%와 3.3%의 정확도 향상을 달성했습니다. 이는 기존 최고의 7B 베이스라인과 비교한 결과입니다.

English

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

UniVid: 오픈소스 통합 비디오 모델

UniVid: The Open-Source Unified Video Model

초록

Support