시각의 비용: 단일 패러다임 내에서 신뢰할 수 있는 멀티모달 추론의 달성

초록

시각-언어 모델(VLM)의 급속한 확산은 종종 통합된 다중모달 지식 발견을 가능하게 하는 것으로 간주되지만, 이는 현재 VLM이 다중모달 데이터를 충실히 종합한다는 충분히 검토되지 않은 가정에 기반한다. 우리는 실제로 그렇지 않은 경우가 많으며, 이러한 격차는 지배적인 시각 인코더-프로젝터-LLM 패러다임에서의 신뢰성 문제를 반영한다고 주장한다. 최신 모델들은 시각 입력에서 근거 있는 지식을 추출하기보다는, 강력한 언어 사전 지식을 활용하여 심각한 시각 표현 병목현상을 우회하는 기능적 맹점을 자주 보인다. 본 연구에서 우리는 데이터 제거 또는 새로운 데이터셋 생성에 의존하여 데이터셋 편향과 구조적 한계를 혼동하는 기존의 다중모달 평가 방법론에 도전한다. 우리는 정보 이론적 접근인 모달리티 변환 프로토콜을 제안하며, 이는 우리가 보기의 비용이라 부르는 것을 정량화하도록 설계되었다. 의미적 페이로드를 제거하는 대신 변환함으로써, 우리는 세 가지 새로운 지표인 보기의 대가, 보기의 저주, 보기의 오류를 정식화하고, 이를 의미 충분성 기준으로 집대성한다. 더 나아가, 우리는 다중모달 스케일링의 발산 법칙을 가설로 제시한다: 기반 언어 엔진이 전례 없는 추론 능력으로 확장됨에 따라, 시각 지식 병목현상의 패널티는 줄어들기보다 오히려 증가할 수 있다. 우리는 학계가 주요 평가 목표로서의 '다중모달 이득'에서 벗어나야 한다고 주장한다. 의미 충분성 기준을 수동적 진단 제약에서 능동적 구조 설계도로 격상시킴으로써, 우리는 차세대 AI 시스템을 진정한 다중모달 추론으로 이끌기 위한 기초를 제공한다.

English

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.