EditVerse: 인컨텍스트 학습을 통한 이미지와 비디오 편집 및 생성의 통합

초록

최근 파운데이션 모델의 발전은 통합과 확장이라는 명확한 트렌드를 보여주며, 다양한 도메인에서의 새로운 능력이 나타나고 있음을 보여줍니다. 이미지 생성 및 편집은 작업별 특화된 방식에서 통합된 프레임워크로 빠르게 전환되었지만, 비디오 생성 및 편집은 아키텍처적 한계와 데이터 부족으로 인해 여전히 분열된 상태입니다. 본 연구에서는 단일 모델 내에서 이미지와 비디오 생성 및 편집을 위한 통합 프레임워크인 EditVerse를 소개합니다. 텍스트, 이미지, 비디오와 같은 모든 모달리티를 통합된 토큰 시퀀스로 표현함으로써, EditVerse는 자기 주의 메커니즘을 활용하여 강력한 문맥 내 학습, 자연스러운 교차 모달 지식 전달, 그리고 임의의 해상도와 지속 시간을 가진 입력과 출력의 유연한 처리를 달성합니다. 비디오 편집 학습 데이터의 부족을 해결하기 위해, 우리는 232K개의 비디오 편집 샘플을 큐레이팅하고 이를 대규모 이미지 및 비디오 데이터셋과 결합하여 공동 학습을 수행하는 확장 가능한 데이터 파이프라인을 설계했습니다. 또한, 다양한 작업과 해상도를 포함한 최초의 지시 기반 비디오 편집 벤치마크인 EditVerseBench를 제시합니다. 광범위한 실험과 사용자 연구를 통해 EditVerse가 최신 기술을 능가하는 성능을 달성하며, 기존의 오픈소스 및 상용 모델을 능가하고, 다양한 모달리티에서 새로운 편집 및 생성 능력을 보여줌을 입증합니다.

English

Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.