InstructX:迈向基于多模态大语言模型指导的统一视觉编辑
InstructX: Towards Unified Visual Editing with MLLM Guidance
October 9, 2025
作者: Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He
cs.AI
摘要
随着多模态大语言模型(MLLMs)在视觉理解与推理方面取得显著进展,利用其提升扩散模型编辑性能的兴趣日益浓厚。尽管进展迅速,但多数研究对MLLM设计选择缺乏深入分析。此外,在某些复杂任务如视频编辑中,MLLMs与扩散模型的整合仍面临挑战。本文提出InstructX,一个统一的图像与视频编辑框架。具体而言,我们全面研究了如何将MLLMs与扩散模型结合,以支持跨多样任务的指令驱动编辑。基于此研究,我们分析了统一建模下图像与视频间的协作与差异:(1) 我们展示了仅通过图像数据训练,无需显式监督即可涌现视频编辑能力,从而缓解视频训练数据稀缺的限制。(2) 通过融入模态特定的MLLM特征,我们的方法有效实现了单一模型内图像与视频编辑任务的统一。大量实验证明,我们的方法能够处理广泛的图像与视频编辑任务,并达到业界领先的性能水平。
English
With recent advances in Multimodal Large Language Models (MLLMs) showing
strong visual understanding and reasoning, interest is growing in using them to
improve the editing performance of diffusion models. Despite rapid progress,
most studies lack an in-depth analysis of MLLM design choices. Moreover, the
integration of MLLMs and diffusion models remains an open challenge in some
difficult tasks, such as video editing. In this paper, we present InstructX, a
unified framework for image and video editing. Specifically, we conduct a
comprehensive study on integrating MLLMs and diffusion models for
instruction-driven editing across diverse tasks. Building on this study, we
analyze the cooperation and distinction between images and videos in unified
modeling. (1) We show that training on image data can lead to emergent video
editing capabilities without explicit supervision, thereby alleviating the
constraints imposed by scarce video training data. (2) By incorporating
modality-specific MLLM features, our approach effectively unifies image and
video editing tasks within a single model. Extensive experiments demonstrate
that our method can handle a broad range of image and video editing tasks and
achieves state-of-the-art performance.