ChatPaper.aiChatPaper

喚醒統一多模態理解與生成中的空間智能

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

May 5, 2026
作者: Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan
cs.AI

摘要

我們推出JoyAI-Image——一個面向視覺理解、文字生成圖像及指令引導圖像編輯的統一多模態基礎模型。該模型通過空間增強的多元大型語言模型(MLLM)與多元擴散轉換器(MMDiT)的耦合架構,讓感知與生成能力透過共享的多模態介面相互協作。圍繞此架構,我們構建了可擴展的訓練方案,融合了統一指令微調、長文本渲染監督、空間錨定數據以及通用與空間編輯信號。此設計使模型具備廣泛的多模態能力,同時強化了幾何感知推理與可控視覺合成。在理解、生成、長文本渲染和編輯等多項基準測試中,JoyAI-Image均實現了最先進或極具競爭力的性能。更重要的是,增強理解、可控空間編輯與新視角輔助推理之間形成的雙向循環,使模型能突破通用視覺能力,邁向更強大的空間智能。這些成果為統一視覺模型在視覺-語言-行動系統及世界模型等下游應用中開闢了可行路徑。
English
We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.
PDF80May 8, 2026