視覚の代償：モノリシックパラダイムにおける信頼できるマルチモーダル推論の達成

要旨

視覚言語モデル（VLM）の急速な普及は、しばしば統一的なマルチモーダル知識発見を可能にするものとして捉えられているが、その根底には未検証の前提が存在する。すなわち、現在のVLMがマルチモーダルデータを忠実に統合しているという前提である。本稿では、実際にはそうではないことが多いと主張し、この乖離が、主流である視覚エンコーダ・プロジェクタ・LLMパラダイムにおける信頼性の問題を反映していると論じる。最先端のモデルは、視覚入力から根拠に基づいた知識を抽出する代わりに、強い言語事前知識を利用して深刻な視覚表現のボトルネックを回避する、すなわち機能的盲目を示すことが頻繁にある。本研究では、データアブレーションや新しいデータセットの作成に依存し、その結果データセットバイアスとアーキテクチャの能力不足とを混同する従来のマルチモーダル評価方法論に挑戦する。我々は情報理論的な転換として、モダリティ翻訳プロトコルを提案する。これは「見ることの代償」を定量化するために設計されたものである。意味的ペイロードをアブレーションするのではなく翻訳することで、我々は三つの新しい指標——「見ることの通行料（ToS）」「見ることの呪い（CoS）」「見ることの誤謬（FoS）」——を定式化し、最終的に意味的十分性基準（SSC）へと集約する。さらに、マルチモーダルスケーリングの発散法則という仮説を提示する。すなわち、基盤となる言語エンジンが前例のない推論能力へとスケールするにつれて、視覚知識ボトルネックのペナルティは減少するどころか増大する可能性がある。我々は、コミュニティが「マルチモーダル利得」を主要な評価目標とすることから脱却すべきであると主張する。SSCを受動的な診断制約から能動的なアーキテクチャ設計図へと昇華させることにより、次世代のAIシステムを真のマルチモーダル推論へと導く基盤を提供する。

English

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.