InstructX: MLLM 지도를 통한 통합 시각 편집을 향하여

초록

최근 멀티모달 대형 언어 모델(MLLM)의 발전으로 강력한 시각 이해와 추론 능력이 입증되면서, 이를 확산 모델(diffusion model)의 편집 성능 향상에 활용하려는 관심이 높아지고 있습니다. 그러나 급속한 발전에도 불구하고, 대부분의 연구는 MLLM 설계 선택에 대한 심층적인 분석이 부족합니다. 또한, MLLM과 확산 모델의 통합은 비디오 편집과 같은 일부 어려운 작업에서 여전히 해결해야 할 과제로 남아 있습니다. 본 논문에서는 이미지 및 비디오 편집을 위한 통합 프레임워크인 InstructX를 제안합니다. 구체적으로, 다양한 작업에 걸쳐 지시 기반 편집을 위해 MLLM과 확산 모델을 통합하는 방법에 대한 포괄적인 연구를 수행합니다. 이를 바탕으로, 통합 모델링에서 이미지와 비디오 간의 협력과 차이를 분석합니다. (1) 이미지 데이터에 대한 학습이 명시적인 감독 없이도 비디오 편집 능력을 자연스럽게 발현시킬 수 있음을 보여주어, 부족한 비디오 학습 데이터로 인한 제약을 완화합니다. (2) 모달리티별 MLLM 특징을 통합함으로써, 우리의 접근 방식은 단일 모델 내에서 이미지와 비디오 편집 작업을 효과적으로 통일합니다. 광범위한 실험을 통해 우리의 방법이 다양한 이미지 및 비디오 편집 작업을 처리할 수 있으며 최첨단 성능을 달성함을 입증합니다.

English

With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.