视觉指令瓶颈调优

摘要

尽管多模态大语言模型（MLLMs）已得到广泛应用，但在面对分布变化下的陌生查询时，其性能仍会下降。现有提升MLLM泛化能力的方法通常需要更多的指令数据或更先进的模型架构，这两者都伴随着不小的人力或计算成本。本研究从表示学习的角度出发，采取了一种不同的策略来增强MLLM在分布变化下的鲁棒性。受信息瓶颈（IB）原理启发，我们为MLLM推导了IB的变分下界，并设计了一种实用实现——视觉指令瓶颈调优（Vittle）。随后，通过揭示Vittle与MLLM信息论鲁棒性度量的联系，我们为其提供了理论依据。在涵盖45个数据集（包括30种变化场景）的开放式与封闭式问答及对象幻觉检测任务上，对三种MLLM进行的实证验证表明，Vittle通过追求最小充分表示的学习，持续提升了MLLM在变化条件下的鲁棒性。

English

Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by the information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of three MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.