视觉指令瓶颈调优
Visual Instruction Bottleneck Tuning
May 20, 2025
作者: Changdae Oh, Jiatong Li, Shawn Im, Yixuan Li
cs.AI
摘要
尽管多模态大语言模型(MLLMs)已得到广泛应用,但在面对分布变化下的陌生查询时,其性能仍会下降。现有提升MLLM泛化能力的方法通常需要更多的指令数据或更先进的模型架构,这两者都伴随着不小的人力或计算成本。本研究从表示学习的角度出发,采取了一种不同的策略来增强MLLM在分布变化下的鲁棒性。受信息瓶颈(IB)原理启发,我们为MLLM推导了IB的变分下界,并设计了一种实用实现——视觉指令瓶颈调优(Vittle)。随后,通过揭示Vittle与MLLM信息论鲁棒性度量的联系,我们为其提供了理论依据。在涵盖45个数据集(包括30种变化场景)的开放式与封闭式问答及对象幻觉检测任务上,对三种MLLM进行的实证验证表明,Vittle通过追求最小充分表示的学习,持续提升了MLLM在变化条件下的鲁棒性。
English
Despite widespread adoption, multimodal large language models (MLLMs) suffer
performance degradation when encountering unfamiliar queries under distribution
shifts. Existing methods to improve MLLM generalization typically require
either more instruction data or larger advanced model architectures, both of
which incur non-trivial human labor or computational costs. In this work, we
take an alternative approach to enhance the robustness of MLLMs under
distribution shifts, from a representation learning perspective. Inspired by
the information bottleneck (IB) principle, we derive a variational lower bound
of the IB for MLLMs and devise a practical implementation, Visual Instruction
Bottleneck Tuning (Vittle). We then provide a theoretical justification of
Vittle by revealing its connection to an information-theoretic robustness
metric of MLLM. Empirical validation of three MLLMs on open-ended and
closed-form question answering and object hallucination detection tasks over 45
datasets, including 30 shift scenarios, demonstrates that Vittle consistently
improves the MLLM's robustness under shifts by pursuing the learning of a
minimal sufficient representation.Summary
AI-Generated Summary