UniVid:基于预训练视频生成模型的多视觉任务统一框架
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
September 26, 2025
作者: Lan Chen, Yuchao Gu, Qi Mao
cs.AI
摘要
大型语言模型通过在大规模语料库上的训练,成功地将多样化的语言任务统一于单一的生成框架之中。受此启发,近期如大型视觉模型(LVM)等研究将这一范式扩展至视觉领域,通过将任务组织成连续的视觉句子,其中视觉提示作为上下文引导输出。然而,此类建模需要跨模态和跨来源的任务特定预训练,成本高昂且限制了向未见任务的扩展性。鉴于预训练的视频生成模型本质上已捕捉到时序依赖关系,我们探索了一种更为统一且可扩展的替代方案:预训练的视频生成模型能否适应多样的图像与视频任务?为此,我们提出了UniVid框架,它通过微调视频扩散变换器来处理多种视觉任务,而无需针对特定任务进行修改。任务被表示为视觉句子,其中上下文序列既定义了任务也指明了期望的输出模态。我们从两个角度评估UniVid的泛化能力:(1) 跨模态推理,上下文由图像和视频共同构成,超越了LVM的单模态设定;(2) 跨来源任务,从自然数据到标注数据,无需多源预训练。尽管仅基于自然视频数据进行训练,UniVid在这两种情境下均展现出良好的泛化性能。值得注意的是,在此范式中,理解与生成任务可通过简单反转视觉句子的顺序轻松切换。这些发现凸显了预训练视频生成模型作为视觉建模可扩展统一基础的潜力。我们的代码将在https://github.com/CUC-MIPG/UniVid 发布。
English
Large language models, trained on extensive corpora, successfully unify
diverse linguistic tasks within a single generative framework. Inspired by
this, recent works like Large Vision Model (LVM) extend this paradigm to vision
by organizing tasks into sequential visual sentences, where visual prompts
serve as the context to guide outputs. However, such modeling requires
task-specific pre-training across modalities and sources, which is costly and
limits scalability to unseen tasks. Given that pre-trained video generation
models inherently capture temporal sequence dependencies, we explore a more
unified and scalable alternative: can a pre-trained video generation model
adapt to diverse image and video tasks? To answer this, we propose UniVid, a
framework that fine-tunes a video diffusion transformer to handle various
vision tasks without task-specific modifications. Tasks are represented as
visual sentences, where the context sequence defines both the task and the
expected output modality. We evaluate the generalization of UniVid from two
perspectives: (1) cross-modal inference with contexts composed of both images
and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks
from natural to annotated data, without multi-source pre-training. Despite
being trained solely on natural video data, UniVid generalizes well in both
settings. Notably, understanding and generation tasks can easily switch by
simply reversing the visual sentence order in this paradigm. These findings
highlight the potential of pre-trained video generation models to serve as a
scalable and unified foundation for vision modeling. Our code will be released
at https://github.com/CUC-MIPG/UniVid.