唤醒统一多模态理解与生成中的空间智能
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
May 5, 2026
作者: Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan
cs.AI
摘要
我们推出JoyAI-Image——一个面向视觉理解、文生图生成及指令引导图像编辑的多模态统一基础模型。该模型通过空间增强型多模态大语言模型(MLLM)与多模态扩散Transformer(MMDiT)的耦合架构,使感知与生成能力通过共享的多模态接口实现交互。围绕此架构,我们构建了可扩展的训练方案,融合了统一指令微调、长文本渲染监督、空间锚定数据以及通用与空间编辑信号。这一设计在赋予模型广泛多模态能力的同时,强化了几何感知推理与可控视觉合成。在理解、生成、长文本渲染和编辑等多项基准测试中,JoyAI-Image均达到领先或极具竞争力的性能。更重要的是,增强理解、可控空间编辑与新视角辅助推理之间形成的双向闭环,使模型能够突破通用视觉能力边界,向更强的空间智能迈进。这些成果为统一视觉模型在视觉-语言-动作系统、世界模型等下游应用中的发展指明了可行路径。
English
We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.