多模态大语言模型中的视觉表征法则
Law of Vision Representation in MLLMs
August 29, 2024
作者: Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu
cs.AI
摘要
我们提出了多模态大语言模型(MLLMs)中的“视觉表征定律”。该定律揭示了跨模态对齐、视觉表征一致性及MLLM性能三者之间的强相关性。我们采用跨模态对齐与一致性评分(AC评分)量化了这两个因素。通过涵盖十三种不同视觉表征设置及跨越八个基准的广泛实验,我们发现AC评分与模型性能呈线性相关。利用这一关系,我们能够仅识别并训练最优视觉表征,而无需每次微调语言模型,从而实现了计算成本99.7%的显著降低。
English
We present the "Law of Vision Representation" in multimodal large language
models (MLLMs). It reveals a strong correlation between the combination of
cross-modal alignment, correspondence in vision representation, and MLLM
performance. We quantify the two factors using the cross-modal Alignment and
Correspondence score (AC score). Through extensive experiments involving
thirteen different vision representation settings and evaluations across eight
benchmarks, we find that the AC score is linearly correlated to model
performance. By leveraging this relationship, we are able to identify and train
the optimal vision representation only, which does not require finetuning the
language model every time, resulting in a 99.7% reduction in computational
cost.