看见的代价：在单一范式内实现可信的多模态推理

摘要

视觉-语言模型（VLM）的快速普及常被描绘为实现统一多模态知识发现的关键，但其基于一个未经充分审视的假设：当前VLM能够忠实地综合多模态数据。我们认为情况往往并非如此，这一差距反映了主流视觉编码器-投影器-大语言模型范式中的可信度问题。最先进的模型并非从视觉输入中提取有依据的知识，而常表现出“功能盲视”——即利用强大语言先验来规避严重的视觉表征瓶颈。本研究挑战了多模态评估的传统方法论，该方法依赖数据消融或创建新数据集，从而将数据集偏差与架构能力不足混为一谈。我们提出一种信息论视角的转向：模态翻译协议，旨在量化我们所谓的“视觉代价”。通过翻译语义载荷而非消融它们，我们构建了三个新颖指标——视觉代价税、视觉代价诅咒与视觉代价谬误，最终形成语义充分性准则。此外，我们提出一个假设：多模态规模化的偏离定律——随着底层语言引擎扩展至前所未有的推理能力，视觉知识瓶颈的惩罚可能增加而非减少。我们认为学界应超越以“多模态增益”作为主要评估目标。通过将语义充分性准则从被动诊断约束提升为主动架构蓝图，我们为引导下一代人工智能系统走向真正多模态推理奠定了基础。

English

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.