大型語言和視覺模型的引人入勝特性

摘要

最近，由於其在需要感知和認知能力的各種任務中展現出卓越的泛化性能，大型語言和視覺模型（LLVMs）受到了顯著的關注和開發努力。它們成功的關鍵因素之一是其簡單的架構，包括視覺編碼器、投影器和大型語言模型（LLM）。儘管在高級推理任務中取得了成就，但它們在基本的與感知相關的任務（例如MMVP）上的表現仍然令人驚訝地低。這種差異引發了一個問題，即LLVMs如何真正感知圖像並利用視覺編碼器的優勢。為了解決這個問題，我們系統地探討了幾個方面：置換不變性、韌性、數學推理、保持對齊和重要性，通過評估最常見的LLVM家族（即LLaVA）在10個評估基準上的表現。我們的廣泛實驗揭示了當前LLVMs的幾個引人入勝的特性：（1）即使視覺補丁序列的順序被隨機置換，它們也在內部以全局方式處理圖像；（2）它們有時能夠在沒有完全感知詳細數字信息的情況下解決數學問題；（3）跨模態對齊過度擬合於複雜的推理任務，從而使它們失去視覺編碼器的一些原始感知能力；（4）在較低層的表示空間（<25%）在確定性能和增強視覺理解方面發揮著至關重要的作用。最後，基於上述觀察，我們提出了建立更好的LLVMs和構建更具挑戰性的評估基準的潛在未來方向。

English

Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

大型語言和視覺模型的引人入勝特性

Intriguing Properties of Large Language and Vision Models

摘要

Support