통합 멀티모달 이해 및 생성을 통한 공간 지능의 각성

초록

본 논문에서는 시각 이해, 텍스트-이미지 생성, 지시어 기반 이미지 편집을 위한 통합 멀티모달 기초 모델인 JoyAI-Image를 제안합니다. JoyAI-Image는 공간 정보가 강화된 멀티모달 대형 언어 모델(MLLM)과 멀티모달 확산 트랜스포머(MMDiT)를 결합하여, 인식과 생성이 공유된 멀티모달 인터페이스를 통해 상호작용할 수 있도록 합니다. 이 아키텍처를 기반으로 통합 지시어 튜닝, 장문 텍스트 렌더링 감독, 공간 기반 데이터, 일반 및 공간 편집 신호를 결합한 확장 가능한 학습 방법론을 구축했습니다. 이 설계는 모델에 광범위한 멀티모달 능력을 부여하면서 기하학적 인식 추론과 제어 가능한 시각 합성 능력을 강화합니다. 이해, 생성, 장문 텍스트 렌더링, 편집 벤치마크에 대한 실험 결과, JoyAI-Image는 최첨단이거나 매우 경쟁력 있는 성능을 달성함을 보여줍니다. 더 중요하게는, 강화된 이해 능력, 제어 가능한 공간 편집, 그리고 새로운 시점 지원 추론 간의 양방향 순환 구조를 통해 모델이 일반적인 시각 능력을 넘어 더 강력한 공간 지능으로 나아갈 수 있게 합니다. 이러한 결과는 시각-언어-행동 시스템 및 월드 모델과 같은 다운스트림 애플리케이션에서 통합 시각 모델의 유망한 발전 방향을 제시합니다.

English

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

통합 멀티모달 이해 및 생성을 통한 공간 지능의 각성

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

초록

Support