InternVL-U：理解・推論・生成・編集のための統合マルチモーダルモデルの民主化

要旨

理解、推論、生成、編集を統合した統一マルチモーダルモデル（UMM）は、強力な意味理解の維持と高度な生成能力の獲得の間に本質的なトレードオフを抱えている。本報告では、軽量な40億パラメータのUMMであるInternVL-Uを提案する。本モデルは統一フレームワーク内でこれらの能力を民主化する。統一的文脈モデリングと分離された視覚表現に基づくモダリティ特化のモジュラー設計という原則に基づき、InternVL-Uは最先端のマルチモーダル大規模言語モデル（MLLM）と専用のMMDiTベース視覚生成ヘッドを統合する。美的生成と高次知能の隔たりをさらに埋めるため、推論中心のパラダイムの下で、連鎖的思考（CoT）を活用して抽象的なユーザ意図と微細な視覚生成の詳細をより良く整合させることで、テキスト描画や科学的推論など高意味密度タスクを標的とした包括的データ合成パイプラインを構築した。大規模な実験により、InternVL-Uが優れた性能と効率のバランスを達成することを実証する。わずか40億パラメータのみを使用しながらも、BAGEL（140億パラメータ）など規模が3倍以上大きい統一ベースラインモデルを、様々な生成・編集タスクで一貫して凌駕し、強力なマルチモーダル理解・推論能力を保持する。

English

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

InternVL-U：理解・推論・生成・編集のための統合マルチモーダルモデルの民主化

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

要旨

Support