EditVerse:透過上下文學習統一圖像與視頻的編輯與生成
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
September 24, 2025
作者: Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, Qiang Xu
cs.AI
摘要
近期基礎模型的進展凸顯了一個明顯的趨勢,即向統一化和規模化發展,並在多個領域展現出湧現能力。雖然圖像生成和編輯已迅速從任務專用框架過渡到統一框架,但由於架構限制和數據稀缺,視頻生成和編輯仍然處於碎片化狀態。在本研究中,我們引入了EditVerse,這是一個在單一模型內實現圖像和視頻生成與編輯的統一框架。通過將所有模態(即文本、圖像和視頻)表示為統一的標記序列,EditVerse利用自注意力機制實現了強大的上下文學習、自然的跨模態知識轉移,以及對任意分辨率和時長的輸入輸出進行靈活處理。為了解決視頻編輯訓練數據的不足,我們設計了一個可擴展的數據管道,精心策劃了232K個視頻編輯樣本,並將其與大規模圖像和視頻數據集結合進行聯合訓練。此外,我們提出了EditVerseBench,這是首個涵蓋多種任務和分辨率的基於指令的視頻編輯基準。大量的實驗和用戶研究表明,EditVerse達到了最先進的性能,超越了現有的開源和商業模型,同時在多模態中展現出湧現的編輯和生成能力。
English
Recent advances in foundation models highlight a clear trend toward
unification and scaling, showing emergent capabilities across diverse domains.
While image generation and editing have rapidly transitioned from task-specific
to unified frameworks, video generation and editing remain fragmented due to
architectural limitations and data scarcity. In this work, we introduce
EditVerse, a unified framework for image and video generation and editing
within a single model. By representing all modalities, i.e., text, image, and
video, as a unified token sequence, EditVerse leverages self-attention to
achieve robust in-context learning, natural cross-modal knowledge transfer, and
flexible handling of inputs and outputs with arbitrary resolutions and
durations. To address the lack of video editing training data, we design a
scalable data pipeline that curates 232K video editing samples and combines
them with large-scale image and video datasets for joint training. Furthermore,
we present EditVerseBench, the first benchmark for instruction-based video
editing covering diverse tasks and resolutions. Extensive experiments and user
studies demonstrate that EditVerse achieves state-of-the-art performance,
surpassing existing open-source and commercial models, while exhibiting
emergent editing and generation abilities across modalities.