視覺的代價：在整體範式內實現可信的多模態推理

摘要

視覺語言模型（VLMs）的迅速普及，常被描述為促成統一多模態知識發現的契機，但這背後隱藏著一個未經充分檢驗的假設：當前 VLM 能夠忠實地整合多模態數據。我們認為情況往往並非如此，而此差距正反映出主流「視覺編碼器－投影器－大型語言模型」典範中的可信度問題。最先進的模型並非從視覺輸入中提取紮根知識，反而經常展現出一種功能性盲區，亦即利用強大的語言先驗知識來繞過嚴重的視覺表徵瓶頸。在本研究中，我們挑戰傳統的多模態評估方法論，該方法依賴於資料消融或創建新資料集，因而將資料集偏差與架構能力不足混為一談。我們提出一個資訊理論層面的新方向：模態翻譯協議，旨在量化我們所謂的「看見的代價」。透過轉譯語義載荷而非將其消融，我們建構了三項新指標——看見的代價（Toll of Seeing, ToS）、看見的詛咒（Curse of Seeing, CoS）與看見的謬誤（Fallacy of Seeing, FoS）——最終形成語義充分性準則（Semantic Sufficiency Criterion, SSC）。此外，我們提出一項多模態規模分歧律假說：當底層語言引擎擴展至前所未見的推理能力時，視覺知識瓶頸的懲罰可能不減反增。我們主張學界應超越以「多模態增益」為主要評估目標。透過將 SSC 從被動的診斷限制提升為主動的架構藍圖，我們為引領下一代 AI 系統邁向真正的多模態推理奠定了基礎。

English

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.