MLLMs 中的視覺表示法定律
Law of Vision Representation in MLLMs
August 29, 2024
作者: Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu
cs.AI
摘要
我們提出了在多模態大型語言模型(MLLMs)中的「視覺表徵法則」。它揭示了跨模態對齊的組合、視覺表徵中的對應以及MLLM性能之間的強烈相關性。我們使用跨模態對齊和對應分數(AC分數)來量化這兩個因素。通過涉及十三種不同視覺表徵設置的廣泛實驗以及在八個基準測試中的評估,我們發現AC分數與模型性能呈線性相關。通過利用這種關係,我們能夠僅識別並訓練最佳的視覺表徵,而無需每次微調語言模型,從而使計算成本減少了99.7%。
English
We present the "Law of Vision Representation" in multimodal large language
models (MLLMs). It reveals a strong correlation between the combination of
cross-modal alignment, correspondence in vision representation, and MLLM
performance. We quantify the two factors using the cross-modal Alignment and
Correspondence score (AC score). Through extensive experiments involving
thirteen different vision representation settings and evaluations across eight
benchmarks, we find that the AC score is linearly correlated to model
performance. By leveraging this relationship, we are able to identify and train
the optimal vision representation only, which does not require finetuning the
language model every time, resulting in a 99.7% reduction in computational
cost.