UniVid: 事前学習済み動画生成モデルによる視覚タスクの統合

要旨

大規模言語モデルは、広範なコーパスで訓練され、多様な言語タスクを単一の生成フレームワーク内で統合することに成功している。これに触発され、Large Vision Model (LVM) のような最近の研究では、このパラダイムを視覚領域に拡張し、タスクを連続的な視覚文として組織化し、視覚プロンプトを出力を導く文脈として活用している。しかし、このようなモデリングは、モダリティやソースにわたるタスク固有の事前学習を必要とし、コストがかかるだけでなく、未見のタスクへのスケーラビリティを制限する。事前学習されたビデオ生成モデルは、本質的に時間的シーケンスの依存関係を捉えていることから、我々はより統合的でスケーラブルな代替案を探求する：事前学習されたビデオ生成モデルは、多様な画像およびビデオタスクに適応できるか？この問いに答えるため、我々はUniVidを提案する。これは、ビデオ拡散トランスフォーマーを微調整し、タスク固有の修正なしに様々な視覚タスクを処理するフレームワークである。タスクは視覚文として表現され、文脈シーケンスがタスクと期待される出力モダリティの両方を定義する。我々はUniVidの汎化能力を二つの観点から評価する：(1) 画像とビデオの両方で構成される文脈を用いたクロスモーダル推論。これはLVMの単一モーダル設定を超えるものである。(2) 自然データから注釈付きデータへのクロスソースタスク。これにはマルチソース事前学習を必要としない。自然ビデオデータのみで訓練されたにもかかわらず、UniVidは両方の設定で良好に汎化する。特に、理解タスクと生成タスクは、このパラダイムにおいて視覚文の順序を逆にするだけで簡単に切り替えることができる。これらの発見は、事前学習されたビデオ生成モデルが、視覚モデリングのためのスケーラブルで統合された基盤としての潜在能力を持つことを強調している。我々のコードはhttps://github.com/CUC-MIPG/UniVidで公開される予定である。

English

Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.

UniVid: 事前学習済み動画生成モデルによる視覚タスクの統合

UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

要旨

Support