MLLMs 中的視覺表示法定律

摘要

我們提出了在多模態大型語言模型（MLLMs）中的「視覺表徵法則」。它揭示了跨模態對齊的組合、視覺表徵中的對應以及MLLM性能之間的強烈相關性。我們使用跨模態對齊和對應分數（AC分數）來量化這兩個因素。通過涉及十三種不同視覺表徵設置的廣泛實驗以及在八個基準測試中的評估，我們發現AC分數與模型性能呈線性相關。通過利用這種關係，我們能夠僅識別並訓練最佳的視覺表徵，而無需每次微調語言模型，從而使計算成本減少了99.7%。

English

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

MLLMs 中的視覺表示法定律

Law of Vision Representation in MLLMs

摘要

Support