MMCORE: 表現整合潜在埋め込みによるマルチモーダル接続

要旨

本論文では、マルチモーダル画像生成と編集のための統一フレームワークMMCOREを提案する。MMCOREは事前学習済み視覚言語モデル（VLM）を活用し、学習可能なクエリトークンを通じて意味的視覚埋め込みを予測する。これらは拡散モデルの条件付け信号として機能し、VLMの豊富な理解力と推論能力を視覚生成プロセスに効果的に転移させる。自己回帰モデルと拡散モデルの深層融合やスクラッチからの学習を不要とするこの効率的な設計により、計算コストを大幅に削減しつつ高精細な合成を実現する。 MMCOREはテキストからの画像合成と複数画像の交互生成をシームレスに統合し、空間推論や視覚的接地といった複雑なシナリオにおいて強固なマルチモーダル理解能力を示す。包括的評価により、MMCOREがテキストからの画像生成および単一/複数画像編集ベンチマークの広範な領域において、最先端のベースライン手法を一貫して凌駕することが実証された。

English

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

MMCORE: 表現整合潜在埋め込みによるマルチモーダル接続

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

要旨

Support