MMCORE: 표현 정렬 잠재 임베딩 기반 다중 모달 연결

초록

MMCORE는 멀티모달 이미지 생성 및 편집을 위한 통합 프레임워크입니다. MMCORE는 사전 학습된 비전-언어 모델(VLM)을 활용하여 학습 가능한 쿼리 토큰을 통해 시맨틱 시각 임베딩을 예측하며, 이는 이후 확산 모델의 조건부 신호로 사용됩니다. 이러한 효율적인 설계는 VLM의 풍부한 이해와 추론 능력을 시각 생성 과정에 효과적으로 전이합니다. 자기회귀 모델과 확산 모델 간의 심층 융합이나 처음부터의 학습이 필요 없으므로 MMCORE는 높은 정확도의 합성 품질을 유지하면서 계산 오버헤드를 크게 줄입니다. MMCORE는 텍스트-이미지 합성과 인터리브 이미지 생성을 원활하게 통합하여 공간 추론 및 시각적 접지와 같은 복잡한 시나리오에서 강력한 멀티모달 이해 능력을 입증합니다. 포괄적인 평가 결과, MMCORE는 다양한 텍스트-이미지 및 단일/다중 이미지 편집 벤치마크에서 최신 기준선을 지속적으로 능가하는 성능을 보입니다.

English

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

MMCORE: 표현 정렬 잠재 임베딩 기반 다중 모달 연결

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

초록

Support