읽기, 생각하기가 아니다: 멀티모달 LLM에서 텍스트가 픽셀이 될 때 모달리티 간극 이해와 극복

초록

다중 모달 대규모 언어 모델(MLLM)은 이미지로 제시된 텍스트를 처리할 수 있지만, 동일한 내용이 텍스트 토큰으로 제공될 때보다 종종 성능이 떨어집니다. 우리는 arXiv PDF부터 Wikipedia 페이지에 이르기까지 합성적으로 렌더링된 텍스트와 실제 문서 이미지 모두를 아우르는 5가지 입력 모드에서 7개의 벤치마크를 통해 7개의 MLLM을 평가함으로써 이러한 "모달리티 격차"를 체계적으로 진단합니다. 우리는 모달리티 격차가 작업 및 데이터에 의존적임을 발견했습니다. 예를 들어, 수학 작업은 합성 렌더링에서 60점 이상 저하되는 반면, 실제 문서 이미지는 종종 텍스트 모드 성능을 맞추거나 능가합니다. 글꼴과 해상도와 같은 렌더링 선택은 강력한 교란 요인으로, 글꼴만으로도 정확도가 최대 47% 포인트까지 변동합니다. 이를 이해하기 위해 4,000개 이상의 사례에 대한 근거 이론 기반 오류 분석을 수행한 결과, 이미지 모드는 읽기 오류(계산 및 형식화 실패)를 선택적으로 증폭시키는 반면 지식 및 추론 오류는 크게 변화시키지 않으며, 일부 모델은 시각적 입력 하에서 사고 연쇄 추론 붕괴를 보인다는 사실을 밝혔습니다. 이러한 발견에 동기를 부여받아, 우리는 이미지 입력과 함께 모델 자체의 순수 텍스트 추론 흔적을 모델에 학습시키는 자기 증류 방법을 제안합니다. 이 방법은 GSM8K에서 이미지 모드 정확도를 30.71%에서 92.72%로 높였으며, 치명적 망각 없이 보지 못한 벤치마크로의 전이를 달성했습니다. 전반적으로, 우리의 연구는 모달리티 격차에 대한 체계적인 이해를 제공하고 다중 모달 언어 모델의 시각적 텍스트 이해 능력을 향상시키는 실용적인 길을 제시합니다.

English

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.

읽기, 생각하기가 아니다: 멀티모달 LLM에서 텍스트가 픽셀이 될 때 모달리티 간극 이해와 극복

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

초록

Support