EditVerse：通过上下文学习统一图像与视频的编辑与生成

摘要

近期基础模型的进展突显了一个明确的趋势：统一化与规模化，展现了跨领域的涌现能力。尽管图像生成与编辑已迅速从任务专用转向统一框架，视频生成与编辑仍因架构限制和数据稀缺而处于碎片化状态。本研究中，我们提出了EditVerse，一个在单一模型内实现图像与视频生成及编辑的统一框架。通过将文本、图像和视频等所有模态表示为统一的token序列，EditVerse利用自注意力机制实现了强大的上下文学习、自然的跨模态知识迁移，以及对任意分辨率和时长的输入输出进行灵活处理。针对视频编辑训练数据的不足，我们设计了一个可扩展的数据管道，精心筛选了232K个视频编辑样本，并将其与大规模图像和视频数据集结合进行联合训练。此外，我们推出了EditVerseBench，这是首个涵盖多样化任务和分辨率的基于指令的视频编辑基准。大量实验和用户研究表明，EditVerse实现了最先进的性能，超越了现有的开源和商业模型，同时在跨模态的编辑与生成能力上展现出涌现特性。

English

Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.

EditVerse：通过上下文学习统一图像与视频的编辑与生成

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

摘要

Support