ChatPaper.aiChatPaper

UniVid:开源统一视频模型

UniVid: The Open-Source Unified Video Model

September 29, 2025
作者: Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang
cs.AI

摘要

統一視頻建模結合生成與理解能力日益重要,但面臨兩大關鍵挑戰:在基於流的生成過程中,由於文本與視覺標記的不平衡以及跨模態注意力在流軌跡上的均勻性限制,難以保持語義忠實性;以及如何高效地將以圖像為中心的多模態大語言模型(MLLMs)擴展至視頻領域,而無需進行成本高昂的重新訓練。我們提出了UniVid,這是一種統一架構,通過輕量級適配器將MLLM與擴散解碼器耦合,實現了視頻理解與生成的雙重功能。我們引入了溫度模態對齊技術以提升提示遵循度,並採用金字塔反射機制通過動態關鍵幀選擇實現高效的時序推理。在標準基準上的大量實驗表明,UniVid達到了最先進的性能,與EasyAnimateV5.1相比,在VBench-Long總分上提升了2.2%,在MSVD-QA和ActivityNet-QA上分別比之前最佳的7B基線模型提高了1.0%和3.3%的準確率。
English
Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.
PDF32September 30, 2025