文本导向向量可增强多模态大语言模型中的视觉理解能力
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
May 20, 2025
作者: Woody Haosheng Gan, Deqing Fu, Julian Asilis, Ollie Liu, Dani Yogatama, Vatsal Sharan, Robin Jia, Willie Neiswanger
cs.AI
摘要
引导方法已成为无需修改参数即可有效且精准地指导大型语言模型(LLMs)行为的工具。然而,多模态大型语言模型(MLLMs)目前尚未享受到同等的技术待遇,部分原因在于其新颖性及架构的多样性。受此差距启发,我们探索了是否可以通过稀疏自编码器(SAEs)、均值漂移和线性探测,利用MLLMs中纯文本LLM骨干派生的向量来引导MLLMs。研究发现,基于文本的引导能持续提升多种MLLM架构及视觉任务中的多模态准确性。特别是,均值漂移在CV-Bench上的空间关系准确率提升了高达+7.3%,计数准确率提升了高达+3.3%,超越了提示方法,并在分布外数据集上展现出强大的泛化能力。这些成果表明,文本引导向量作为一种强大而高效的机制,能够以最小的额外数据收集和计算开销,增强MLLMs的接地能力。
English
Steering methods have emerged as effective and targeted tools for guiding
large language models' (LLMs) behavior without modifying their parameters.
Multimodal large language models (MLLMs), however, do not currently enjoy the
same suite of techniques, due in part to their recency and architectural
diversity. Inspired by this gap, we investigate whether MLLMs can be steered
using vectors derived from their text-only LLM backbone, via sparse
autoencoders (SAEs), mean shift, and linear probing. We find that text-derived
steering consistently enhances multimodal accuracy across diverse MLLM
architectures and visual tasks. In particular, mean shift boosts spatial
relationship accuracy on CV-Bench by up to +7.3% and counting accuracy by up to
+3.3%, outperforming prompting and exhibiting strong generalization to
out-of-distribution datasets. These results highlight textual steering vectors
as a powerful, efficient mechanism for enhancing grounding in MLLMs with
minimal additional data collection and computational overhead.Summary
AI-Generated Summary