EditVerse：透過上下文學習統一圖像與視頻的編輯與生成

摘要

近期基礎模型的進展凸顯了一個明顯的趨勢，即向統一化和規模化發展，並在多個領域展現出湧現能力。雖然圖像生成和編輯已迅速從任務專用框架過渡到統一框架，但由於架構限制和數據稀缺，視頻生成和編輯仍然處於碎片化狀態。在本研究中，我們引入了EditVerse，這是一個在單一模型內實現圖像和視頻生成與編輯的統一框架。通過將所有模態（即文本、圖像和視頻）表示為統一的標記序列，EditVerse利用自注意力機制實現了強大的上下文學習、自然的跨模態知識轉移，以及對任意分辨率和時長的輸入輸出進行靈活處理。為了解決視頻編輯訓練數據的不足，我們設計了一個可擴展的數據管道，精心策劃了232K個視頻編輯樣本，並將其與大規模圖像和視頻數據集結合進行聯合訓練。此外，我們提出了EditVerseBench，這是首個涵蓋多種任務和分辨率的基於指令的視頻編輯基準。大量的實驗和用戶研究表明，EditVerse達到了最先進的性能，超越了現有的開源和商業模型，同時在多模態中展現出湧現的編輯和生成能力。

English

Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.

EditVerse：透過上下文學習統一圖像與視頻的編輯與生成

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

摘要

Support