ChatPaper.aiChatPaper

ILLUME+:通过双重视觉标记化与扩散优化实现统一多模态大模型的明澈呈现

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

April 2, 2025
作者: Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu
cs.AI

摘要

我們提出了ILLUME+,它利用雙重視覺標記化和擴散解碼器來提升深度語義理解和高保真圖像生成的能力。現有的統一模型在同時處理理解、生成和編輯這三項基本能力時面臨挑戰。像Chameleon和EMU3這樣的模型使用VQGAN進行圖像離散化,但由於缺乏深度語義交互,它們在視覺理解任務上落後於LLaVA等專業模型。為了解決這個問題,LaViT和ILLUME採用了語義編碼器進行標記化,但由於紋理保留不佳,它們在圖像編輯方面表現欠佳。同時,Janus系列解耦了輸入和輸出圖像的表示,限制了它們無縫處理交錯圖像-文本理解和生成的能力。相比之下,ILLUME+引入了一種統一的雙重視覺標記器DualViTok,它既保留了細粒度的紋理又對齊了文本語義,同時支持從粗到細的圖像表示策略,用於多模態理解和生成。此外,我們採用擴散模型作為圖像解碼器,以提升生成質量和實現高效的超分辨率。ILLUME+在統一的多模態語言模型(MLLM)中遵循連續輸入、離散輸出的方案,並採用漸進式訓練過程,支持視覺標記器、MLLM和擴散解碼器之間的動態分辨率。這種設計使得ILLUME+能夠在多樣化的任務中進行靈活且高效的上下文感知圖像編輯和生成。ILLUME+(3B)在多模態理解、生成和編輯基準測試中展現出與現有統一MLLM和專業模型競爭的性能。憑藉其強大的性能,ILLUME+為未來的多模態應用提供了一個可擴展且多功能的基礎。項目頁面:https://illume-unified-mllm.github.io/。
English
We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.

Summary

AI-Generated Summary

PDF234April 3, 2025