視覺指令瓶頸微調

摘要

尽管多模态大语言模型（MLLMs）已被广泛采用，但在面对分布变化下的陌生查询时，其性能仍会出现下降。现有提升MLLM泛化能力的方法通常需要更多的指令数据或更先进的模型架构，这两者都伴随着不小的人力或计算成本。在本研究中，我们从表示学习的角度出发，采取了一种不同的方法来增强MLLM在分布变化下的鲁棒性。受信息瓶颈（IB）原理启发，我们推导了MLLM的IB变分下界，并设计了一种实用实现——视觉指令瓶颈微调（Vittle）。随后，通过揭示Vittle与MLLM信息论鲁棒性度量的联系，我们为其提供了理论依据。在涵盖45个数据集（包括30种变化场景）的开放式与封闭式问答及对象幻觉检测任务上，对三种MLLM进行的实证验证表明，Vittle通过追求最小充分表示的学习，持续提升了MLLM在变化条件下的鲁棒性。

English

Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by the information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of three MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.