픽셀과 패턴, 그러나 시는 없다: 인간처럼 세상을 보기

초록

다중모드 대형 언어 모델(MLLMs)에서 인간과 유사한 인지 및 추론 능력을 달성하는 것은 인공지능 분야의 핵심 과제로 남아 있습니다. 최근 연구는 주로 MLLMs의 추론 능력 향상에 초점을 맞추어 왔지만, 근본적인 질문은 여전히 남아 있습니다: 다중모드 대형 언어 모델이 정말로 인간처럼 세상을 인지할 수 있는가? 본 논문은 추론에서 인지로 초점을 전환합니다. 추론을 위한 벤치마크를 구축하는 대신, 우리는 인간이 직관적으로 처리하는 합성 이미지에 대한 MLLMs의 성능을 평가하는 네 가지 진단 작업으로 구성된 도전적인 인지 중심 벤치마크인 튜링 아이 테스트(TET)를 소개합니다. 우리의 연구 결과는 최첨단 MLLMs가 인간에게는 사소한 인지 작업에서 치명적인 실패를 보인다는 것을 밝혀냈습니다. 이전 벤치마크에서 효과적이었던 컨텍스트 내 학습과 언어 백본 학습 모두 우리의 작업에서 성능 향상을 이루지 못한 반면, 비전 타워를 미세 조정하면 빠른 적응이 가능했는데, 이는 우리의 벤치마크가 언어 백본의 지식과 추론 능력보다는 비전 타워의 일반화에 도전을 제기한다는 것을 시사합니다. 이는 현재의 MLLMs와 인간 인지 사이의 주요 격차를 나타냅니다. 이번 버전에서는 TET 작업의 대표적인 하위 집합을 공개하며, 향후 작업에서는 시각적 일반화를 강화하기 위해 더 다양한 작업과 방법을 소개할 예정입니다.

English

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

픽셀과 패턴, 그러나 시는 없다: 인간처럼 세상을 보기

Pixels, Patterns, but No Poetry: To See The World like Humans

초록

Support