PaLM2-VAdapter：渐进对齐语言模型构建强大的视觉-语言适配器

摘要

本文证明了逐步对齐的语言模型能够有效地连接冻结的视觉编码器和大型语言模型（LLMs）。虽然视觉编码器和LLMs的基本架构和预训练方法已得到广泛研究，但近期作品中视觉-语言适配器的架构和训练策略却存在显著差异。我们的研究对最先进的感知重采样器架构进行了彻底探索并建立了强大的基准。然而，我们观察到，使用感知重采样器进行视觉-语言对齐表现出较慢的收敛速度和有限的可扩展性，缺乏直接监督。为解决这一问题，我们提出了PaLM2-VAdapter，采用逐步对齐的语言模型作为视觉-语言适配器。与使用感知重采样器的强大基准相比，我们的方法在实证上表现出更快的收敛速度、更高的性能和更强的可扩展性。在各种视觉问答（VQA）和图像、视频字幕任务上进行了大量实验，证明我们的模型具有最先进的视觉理解和多模态推理能力。值得注意的是，我们的方法在比最先进的大型视觉-语言模型少30~70%的参数的情况下实现了这些进展，标志着显著的效率提升。

English

This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a thorough exploration of the state-of-the-art perceiver resampler architecture and builds a strong baseline. However, we observe that the vision-language alignment with perceiver resampler exhibits slow convergence and limited scalability with a lack of direct supervision. To address this issue, we propose PaLM2-VAdapter, employing a progressively aligned language model as the vision-language adapter. Compared to the strong baseline with perceiver resampler, our method empirically shows faster convergence, higher performance, and stronger scalability. Extensive experiments across various Visual Question Answering (VQA) and captioning tasks on both images and videos demonstrate that our model exhibits state-of-the-art visual understanding and multi-modal reasoning capabilities. Notably, our method achieves these advancements with 30~70% fewer parameters than the state-of-the-art large vision-language models, marking a significant efficiency improvement.

PaLM2-VAdapter：渐进对齐语言模型构建强大的视觉-语言适配器

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

摘要

Support