UniVid：オープンソース統合ビデオモデル

要旨

生成と理解の能力を統合したビデオモデリングはますます重要になっていますが、2つの主要な課題に直面しています。1つは、テキストとビジュアルトークンの不均衡によるフローベース生成中の意味的忠実性の維持と、フロートラジェクトリー全体での均一なクロスモーダルアテンションの制限です。もう1つは、コストのかかる再学習なしで、画像中心のMLLMを効率的にビデオに拡張することです。本論文では、UniVidを提案します。これは、軽量なアダプターを介してMLLMとディフュージョンデコーダを結合し、ビデオ理解と生成の両方を可能にする統合アーキテクチャです。プロンプトの遵守を改善するためのTemperature Modality Alignmentと、動的なキーフレーム選択による効率的な時間的推論を実現するPyramid Reflectionを導入します。標準ベンチマークでの広範な実験により、EasyAnimateV5.1と比較してVBench-Longの総合スコアで2.2%の向上、およびMSVD-QAとActivityNet-QAでそれぞれ1.0%と3.3%の精度向上を達成し、最先端の性能を示しました。

English

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

UniVid：オープンソース統合ビデオモデル

UniVid: The Open-Source Unified Video Model

要旨

Support