InternVL-U：普及统一多模态模型，实现理解、推理、生成与编辑的民主化

摘要

统一多模态模型（UMMs）在融合理解、推理、生成与编辑能力时，始终面临保持强语义理解与获得强大生成能力之间的固有权衡。本报告提出InternVL-U——一个轻量级的40亿参数统一多模态模型，旨在通过统一框架实现这些能力的普惠化。该模型以统一上下文建模与解耦视觉表征的模态专用模块化设计为指导思想，将前沿的多模态大语言模型（MLLM）与基于MMDiT的专用视觉生成头相结合。为弥合审美生成与高级智能之间的鸿沟，我们构建了面向高语义密度任务（如文本渲染与科学推理）的综合数据合成流程，采用以推理为核心的范式，通过思维链（CoT）技术将抽象用户意图与细粒度视觉生成细节更精准地对齐。大量实验表明，InternVL-U实现了卓越的性能-效率平衡：尽管仅使用40亿参数，它在各类生成与编辑任务中持续超越规模超其三倍的统一基线模型（如140亿参数的BAGEL），同时保持了强大的多模态理解与推理能力。

English

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

InternVL-U：普及统一多模态模型，实现理解、推理、生成与编辑的民主化

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

摘要

Support