PaLM2-VAdapter:逐步對齊語言模型打造強大的視覺語言適配器
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
February 16, 2024
作者: Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, Boyu Wang
cs.AI
摘要
本文展示了逐步對齊的語言模型能夠有效地搭建凍結視覺編碼器和大型語言模型(LLMs)之間的橋樑。儘管視覺編碼器和LLMs的基本架構和預訓練方法已被廣泛研究,但近期作品中視覺語言適配器的架構和訓練策略卻存在顯著差異。我們的研究對最先進的感知器重取樣器架構進行了深入探索並構建了一個強大的基準。然而,我們觀察到,使用感知器重取樣器進行視覺語言對齊表現出收斂速度緩慢且缺乏直接監督的可擴展性有限。為了解決這個問題,我們提出了PaLM2-VAdapter,採用逐步對齊的語言模型作為視覺語言適配器。與使用感知器重取樣器的強大基準相比,我們的方法在實驗中顯示出更快的收斂速度、更高的性能和更強的可擴展性。在各種視覺問答(VQA)和圖片、視頻標題任務上進行了大量實驗,證明了我們的模型具有最先進的視覺理解和多模態推理能力。值得注意的是,我們的方法在比最先進的大型視覺語言模型少30%至70%的參數下實現了這些進展,標誌著顯著的效率改進。
English
This paper demonstrates that a progressively aligned language model can
effectively bridge frozen vision encoders and large language models (LLMs).
While the fundamental architecture and pre-training methods of vision encoders
and LLMs have been extensively studied, the architecture and training strategy
of vision-language adapters vary significantly across recent works. Our
research undertakes a thorough exploration of the state-of-the-art perceiver
resampler architecture and builds a strong baseline. However, we observe that
the vision-language alignment with perceiver resampler exhibits slow
convergence and limited scalability with a lack of direct supervision. To
address this issue, we propose PaLM2-VAdapter, employing a progressively
aligned language model as the vision-language adapter. Compared to the strong
baseline with perceiver resampler, our method empirically shows faster
convergence, higher performance, and stronger scalability. Extensive
experiments across various Visual Question Answering (VQA) and captioning tasks
on both images and videos demonstrate that our model exhibits state-of-the-art
visual understanding and multi-modal reasoning capabilities. Notably, our
method achieves these advancements with 30~70% fewer parameters than the
state-of-the-art large vision-language models, marking a significant efficiency
improvement.