UniVid：开源统一视频模型

摘要

统一视频建模，融合生成与理解能力，正变得日益重要，但面临两大关键挑战：在基于流的生成过程中，由于文本与视觉标记的不平衡以及跨模态注意力在流轨迹上的统一性限制，难以保持语义忠实度；以及如何高效地将以图像为中心的多模态大语言模型（MLLMs）扩展至视频领域，而无需昂贵的重新训练。我们提出了UniVid，一种统一架构，通过轻量级适配器将MLLM与扩散解码器耦合，实现视频理解与生成的双重功能。我们引入了温度模态对齐技术以增强提示遵循度，以及金字塔反射机制，通过动态关键帧选择实现高效的时间推理。在标准基准上的广泛实验表明，UniVid达到了最先进的性能，相较于EasyAnimateV5.1，在VBench-Long总分上提升了2.2%，在MSVD-QA和ActivityNet-QA上分别比之前最佳的7B基线模型提高了1.0%和3.3%的准确率。

English

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

UniVid：开源统一视频模型

UniVid: The Open-Source Unified Video Model

摘要

Support