텍스트 기반 조정 벡터가 멀티모달 대형 언어 모델의 시각적 이해를 향상시킬 수 있다

초록

스티어링(steering) 방법은 대형 언어 모델(LLMs)의 파라미터를 수정하지 않고도 그 행동을 지시하는 효과적이고 표적화된 도구로 등장했습니다. 그러나 다중 모달 대형 언어 모델(MLLMs)은 아직 이러한 기술들을 동일하게 활용하지 못하고 있는데, 이는 부분적으로 최근에 등장했고 아키텍처가 다양하기 때문입니다. 이러한 격차에서 영감을 받아, 우리는 MLLMs가 희소 자동 인코더(SAEs), 평균 이동(mean shift), 선형 탐사(linear probing)를 통해 텍스트 전용 LLM 백본에서 파생된 벡터를 사용하여 스티어링될 수 있는지 조사했습니다. 우리는 텍스트에서 파생된 스티어링이 다양한 MLLM 아키텍처와 시각적 작업에서 다중 모달 정확도를 지속적으로 향상시킨다는 것을 발견했습니다. 특히, 평균 이동은 CV-Bench에서 공간 관계 정확도를 최대 +7.3%, 카운팅 정확도를 최대 +3.3%까지 향상시켜 프롬프팅을 능가하며, 분포 외 데이터셋에 대한 강력한 일반화 능력을 보였습니다. 이러한 결과는 텍스트 기반 스티어링 벡터가 최소한의 추가 데이터 수집과 계산 오버헤드로 MLLMs의 그라운딩(grounding)을 강화하는 강력하고 효율적인 메커니즘임을 강조합니다.

English

Steering methods have emerged as effective and targeted tools for guiding large language models' (LLMs) behavior without modifying their parameters. Multimodal large language models (MLLMs), however, do not currently enjoy the same suite of techniques, due in part to their recency and architectural diversity. Inspired by this gap, we investigate whether MLLMs can be steered using vectors derived from their text-only LLM backbone, via sparse autoencoders (SAEs), mean shift, and linear probing. We find that text-derived steering consistently enhances multimodal accuracy across diverse MLLM architectures and visual tasks. In particular, mean shift boosts spatial relationship accuracy on CV-Bench by up to +7.3% and counting accuracy by up to +3.3%, outperforming prompting and exhibiting strong generalization to out-of-distribution datasets. These results highlight textual steering vectors as a powerful, efficient mechanism for enhancing grounding in MLLMs with minimal additional data collection and computational overhead.

텍스트 기반 조정 벡터가 멀티모달 대형 언어 모델의 시각적 이해를 향상시킬 수 있다

Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models

초록

Support