EditVerse:通过上下文学习统一图像与视频的编辑与生成
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
September 24, 2025
作者: Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, Qiang Xu
cs.AI
摘要
近期基础模型的进展突显了一个明确的趋势:统一化与规模化,展现了跨领域的涌现能力。尽管图像生成与编辑已迅速从任务专用转向统一框架,视频生成与编辑仍因架构限制和数据稀缺而处于碎片化状态。本研究中,我们提出了EditVerse,一个在单一模型内实现图像与视频生成及编辑的统一框架。通过将文本、图像和视频等所有模态表示为统一的token序列,EditVerse利用自注意力机制实现了强大的上下文学习、自然的跨模态知识迁移,以及对任意分辨率和时长的输入输出进行灵活处理。针对视频编辑训练数据的不足,我们设计了一个可扩展的数据管道,精心筛选了232K个视频编辑样本,并将其与大规模图像和视频数据集结合进行联合训练。此外,我们推出了EditVerseBench,这是首个涵盖多样化任务和分辨率的基于指令的视频编辑基准。大量实验和用户研究表明,EditVerse实现了最先进的性能,超越了现有的开源和商业模型,同时在跨模态的编辑与生成能力上展现出涌现特性。
English
Recent advances in foundation models highlight a clear trend toward
unification and scaling, showing emergent capabilities across diverse domains.
While image generation and editing have rapidly transitioned from task-specific
to unified frameworks, video generation and editing remain fragmented due to
architectural limitations and data scarcity. In this work, we introduce
EditVerse, a unified framework for image and video generation and editing
within a single model. By representing all modalities, i.e., text, image, and
video, as a unified token sequence, EditVerse leverages self-attention to
achieve robust in-context learning, natural cross-modal knowledge transfer, and
flexible handling of inputs and outputs with arbitrary resolutions and
durations. To address the lack of video editing training data, we design a
scalable data pipeline that curates 232K video editing samples and combines
them with large-scale image and video datasets for joint training. Furthermore,
we present EditVerseBench, the first benchmark for instruction-based video
editing covering diverse tasks and resolutions. Extensive experiments and user
studies demonstrate that EditVerse achieves state-of-the-art performance,
surpassing existing open-source and commercial models, while exhibiting
emergent editing and generation abilities across modalities.