偽統一性：熵探測揭示統一多模態模型中的資訊模式分歧

摘要

統一多模態模型（UMMs）的設計初衷是結合大型語言模型（LLMs）的推理能力與視覺模型的生成能力。然而在實際應用中，這種協同效應仍難以實現：UMMs未能將類LLM的推理能力遷移至圖像合成任務，且表現出分裂的回應行為。我們將此現象稱為「偽統一」。診斷其內部成因至關重要，但現有的探測方法要么缺乏模型內部洞察力，要么忽略提示與回應的關聯性。為解決這些局限，我們提出一個信息論探測框架，能同步分析UMMs如何編碼輸入並生成輸出。應用於十個代表性UMMs的實驗表明，偽統一源於雙重分歧：（一）模態非對稱編碼，即視覺與語言遵循不同的熵軌跡；（二）模式分裂回應，即文本生成呈現高熵創造性，而圖像合成強制保持低熵保真度。唯有通過雙向統一（例如基於上下文預測）的模型才能實現更真實的統一，即使參數量更少也能完成更強的基於推理的文生圖任務。本研究首次從模型內部探測統一機制，證實真正的多模態協同需要信息流的一致性，而非僅靠共享參數。

English

Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.

偽統一性：熵探測揭示統一多模態模型中的資訊模式分歧

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

摘要

Support