ChatPaper.aiChatPaper

探索MLLM-擴散資訊傳輸與MetaCanvas的融合應用

Exploring MLLM-Diffusion Information Transfer with MetaCanvas

December 12, 2025
作者: Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang
cs.AI

摘要

多模態學習已迅速推動了視覺理解領域的進展,這主要得益於以強大大型語言模型(LLM)作為認知核心的多模態大語言模型(MLLM)的應用。然而在視覺生成領域,這些強大的核心模型通常被降級為擴散模型的全局文本編碼器,其大部分推理與規劃能力未能得到充分利用。這造成了當前多模態LLM能夠解析複雜佈局、屬性和知識密集型場景,卻難以生成具有同等精確度與結構化控制力的圖像或影片的鴻溝。我們提出輕量級框架MetaCanvas,使MLLM能夠直接在空間與時空潛在空間中進行推理規劃,並與擴散生成器緊密對接。我們在三個不同擴散模型骨幹上實證實現了MetaCanvas,並在六項任務中進行評估,包括文本到圖像生成、文本/圖像到影片生成、圖像/影片編輯以及上下文影片生成,每項任務均需精確佈局、強健的屬性綁定和推理密集型控制。MetaCanvas在全局條件基線對比中持續表現優異,表明將MLLM作為潛在空間規劃器是縮小多模態理解與生成之間差距的可行方向。
English
Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.
PDF112December 17, 2025