ChatPaper.aiChatPaper

探索MLLM-扩散信息传递的MetaCanvas方法

Exploring MLLM-Diffusion Information Transfer with MetaCanvas

December 12, 2025
作者: Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang
cs.AI

摘要

多模态学习通过以强大大型语言模型(LLM)作为认知核心的多模态大语言模型(MLLM),迅速推动了视觉理解的发展。然而在视觉生成领域,这些核心模型通常被降级为扩散模型的全局文本编码器,其大部分推理与规划能力未被充分利用。这造成了当前困境:多模态大语言模型能够解析复杂布局、属性及知识密集型场景,却难以生成具有同等精确结构化控制的图像或视频。我们提出MetaCanvas——一个轻量级框架,使MLLM能够直接在空间与时空潜在空间中进行推理规划,并与扩散生成器紧密交互。我们在三种扩散模型骨干上实证实现了MetaCanvas,并在六大任务中进行评估,包括文本到图像生成、文本/图像到视频生成、图像/视频编辑以及上下文视频生成,每个任务都需要精确布局、强健属性绑定和推理密集型控制。MetaCanvas始终优于全局条件基线方法,表明将MLLM视为潜在空间规划器是缩小多模态理解与生成之间差距的有效路径。
English
Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.
PDF112December 17, 2025