UniVid:利用預訓練視頻生成模型統一視覺任務
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
September 26, 2025
作者: Lan Chen, Yuchao Gu, Qi Mao
cs.AI
摘要
基於大量語料訓練的大型語言模型,成功將多樣化的語言任務統一於單一生成框架之中。受此啟發,近期如大型視覺模型(LVM)等研究將此範式延伸至視覺領域,通過將任務組織成序列化的視覺句子,其中視覺提示作為引導輸出的上下文。然而,此類建模需要跨模態和數據源的任務特定預訓練,成本高昂且限制了對未見任務的可擴展性。鑒於預訓練的視頻生成模型本質上捕捉了時間序列依賴性,我們探索了一種更為統一且可擴展的替代方案:預訓練的視頻生成模型能否適應多樣的圖像和視頻任務?為解答此問題,我們提出了UniVid框架,該框架微調視頻擴散變壓器以處理多種視覺任務,無需任務特定修改。任務被表示為視覺句子,其中上下文序列既定義了任務也指定了期望的輸出模態。我們從兩個角度評估UniVid的泛化能力:(1) 跨模態推理,上下文由圖像和視頻共同構成,超越了LVM的單模態設定;(2) 跨源任務,從自然數據到註釋數據,無需多源預訓練。儘管僅在自然視頻數據上訓練,UniVid在兩種情境下均展現出良好的泛化能力。值得注意的是,在此範式中,理解和生成任務可通過簡單反轉視覺句子的順序輕鬆切換。這些發現凸顯了預訓練視頻生成模型作為視覺建模可擴展且統一基礎的潛力。我們的代碼將發佈於https://github.com/CUC-MIPG/UniVid。
English
Large language models, trained on extensive corpora, successfully unify
diverse linguistic tasks within a single generative framework. Inspired by
this, recent works like Large Vision Model (LVM) extend this paradigm to vision
by organizing tasks into sequential visual sentences, where visual prompts
serve as the context to guide outputs. However, such modeling requires
task-specific pre-training across modalities and sources, which is costly and
limits scalability to unseen tasks. Given that pre-trained video generation
models inherently capture temporal sequence dependencies, we explore a more
unified and scalable alternative: can a pre-trained video generation model
adapt to diverse image and video tasks? To answer this, we propose UniVid, a
framework that fine-tunes a video diffusion transformer to handle various
vision tasks without task-specific modifications. Tasks are represented as
visual sentences, where the context sequence defines both the task and the
expected output modality. We evaluate the generalization of UniVid from two
perspectives: (1) cross-modal inference with contexts composed of both images
and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks
from natural to annotated data, without multi-source pre-training. Despite
being trained solely on natural video data, UniVid generalizes well in both
settings. Notably, understanding and generation tasks can easily switch by
simply reversing the visual sentence order in this paradigm. These findings
highlight the potential of pre-trained video generation models to serve as a
scalable and unified foundation for vision modeling. Our code will be released
at https://github.com/CUC-MIPG/UniVid.