伪统一性：熵探测揭示统一多模态模型中的信息模式分歧

摘要

统一多模态模型（UMMs）旨在融合大语言模型（LLMs）的推理能力与视觉模型的生成能力。然而在实际应用中，这种协同效应仍难以实现：UMMs未能将类LLM的推理能力迁移至图像合成任务，且表现出割裂的响应行为。我们将此现象称为伪统一。诊断其内在成因至关重要，但现有探测方法要么缺乏模型内部洞察力，要么忽略提示与响应的关联性。为突破这些局限，我们提出一种信息论探测框架，可联合分析UMMs如何编码输入并生成输出。通过对十个代表性UMMs的实验，该框架揭示伪统一源于双重分化：（一）模态非对称编码，即视觉与语言遵循不同的熵变化轨迹；（二）模式分裂响应，表现为文本生成呈现高熵创造性，而图像合成强制保持低熵保真度。唯有通过上下文预测等方式实现双向统一的模型，才能达成更真实的统一，即使参数更少也能实现更强的基于推理的文生图性能。本研究首次从模型内部视角探究统一机制，证明真正的多模态协同需要信息流的一致性，而非仅是参数共享。

English

Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.

伪统一性：熵探测揭示统一多模态模型中的信息模式分歧

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

摘要

Support