大型语言和视觉模型的引人注目特性

摘要

最近，由于其在需要感知和认知能力的广泛任务中表现出色的泛语言和视觉模型（LLVMs）受到了重视和开发工作的努力。它们成功的关键因素之一是其简单的架构，包括视觉编码器、投影器和大型语言模型（LLM）。尽管它们在高级推理任务中取得了成就，但在基本的与感知相关的任务（例如MMVP）上的表现令人惊讶地低。这种差异引发了一个问题，即LLVMs如何真正感知图像并利用视觉编码器的优势。为了解决这个问题，我们系统地研究了几个方面：置换不变性、鲁棒性、数学推理、对齐保持和重要性，通过评估最常见的LLVM家族（即LLaVA）在10个评估基准上的表现。我们广泛的实验揭示了当前LLVMs的几个有趣特性：（1）即使视觉补丁序列的顺序被随机置换，它们也会以全局方式内部处理图像；（2）它们有时能够解决数学问题，而不必完全感知详细的数字信息；（3）跨模态对齐过度拟合于复杂推理任务，从而导致它们失去了视觉编码器的一些原始感知能力；（4）较低层的表示空间（<25%）在决定性能和增强视觉理解方面起着至关重要的作用。最后，基于以上观察结果，我们提出了建立更好的LLVMs和构建更具挑战性的评估基准的潜在未来方向。

English

Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

大型语言和视觉模型的引人注目特性

Intriguing Properties of Large Language and Vision Models

摘要

Support