EditVerse: インコンテキスト学習による画像と動画の編集・生成の統合

要旨

基盤モデルの最近の進歩は、統一とスケーリングに向けた明確なトレンドを示しており、多様なドメインにわたって創発的な能力を発揮しています。画像生成と編集は、タスク固有のアプローチから統一されたフレームワークへと急速に移行しましたが、ビデオ生成と編集は、アーキテクチャの制約とデータ不足のために依然として断片化された状態にあります。本研究では、単一モデル内で画像とビデオの生成および編集を行う統一フレームワークであるEditVerseを紹介します。テキスト、画像、ビデオといったすべてのモダリティを統一されたトークンシーケンスとして表現することで、EditVerseは自己注意機構を活用し、堅牢な文脈内学習、自然なクロスモーダル知識転移、任意の解像度と時間長の入力と出力の柔軟な処理を実現します。ビデオ編集のトレーニングデータ不足に対処するため、232Kのビデオ編集サンプルをキュレーションし、大規模な画像およびビデオデータセットと組み合わせて共同トレーニングを行うスケーラブルなデータパイプラインを設計しました。さらに、多様なタスクと解像度をカバーする初の指示ベースのビデオ編集ベンチマークであるEditVerseBenchを提示します。広範な実験とユーザスタディにより、EditVerseが最先端の性能を達成し、既存のオープンソースおよび商用モデルを凌駕しつつ、モダリティを超えた創発的な編集および生成能力を示すことが実証されました。

English

Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.

EditVerse: インコンテキスト学習による画像と動画の編集・生成の統合

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

要旨

Support