MouSi:多視覺專家視覺語言模型
MouSi: Poly-Visual-Expert Vision-Language Models
January 30, 2024
作者: Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
cs.AI
摘要
目前大型視覺語言模型(VLMs)常常面臨挑戰,例如單一視覺組件能力不足和視覺標記過長。這些問題可能限制模型在準確解釋複雜視覺信息和過長上下文信息方面的效能。解決這些挑戰對於提升VLMs的性能和應用至關重要。本文提出使用集成專家技術來協同個別視覺編碼器的能力,包括擅長圖像-文本匹配、OCR、圖像分割等。該技術引入融合網絡來統一處理來自不同視覺專家的輸出,同時彌合圖像編碼器和預訓練LLMs之間的差距。此外,我們探索不同的位置編碼方案,以減輕因長度過長的圖像特徵序列而導致位置編碼浪費的問題,有效解決位置溢出和長度限制問題。例如,在我們的實作中,這個技術將像SAM這樣的模型中的位置佔用從大量的4096顯著降低到更高效且易管理的64,甚至降至1。實驗結果顯示,具有多個專家的VLMs表現出比獨立視覺編碼器更為優越的性能,隨著更多專家的整合,性能得到顯著提升。我們已經在本報告中公開了使用的訓練代碼。所有這些資源都可以在我們的項目網站上找到。
English
Current large vision-language models (VLMs) often encounter challenges such
as insufficient capabilities of a single visual component and excessively long
visual tokens. These issues can limit the model's effectiveness in accurately
interpreting complex visual information and over-lengthy contextual
information. Addressing these challenges is crucial for enhancing the
performance and applicability of VLMs. This paper proposes the use of ensemble
experts technique to synergizes the capabilities of individual visual encoders,
including those skilled in image-text matching, OCR, image segmentation, etc.
This technique introduces a fusion network to unify the processing of outputs
from different visual experts, while bridging the gap between image encoders
and pre-trained LLMs. In addition, we explore different positional encoding
schemes to alleviate the waste of positional encoding caused by lengthy image
feature sequences, effectively addressing the issue of position overflow and
length limitations. For instance, in our implementation, this technique
significantly reduces the positional occupancy in models like SAM, from a
substantial 4096 to a more efficient and manageable 64 or even down to 1.
Experimental results demonstrate that VLMs with multiple experts exhibit
consistently superior performance over isolated visual encoders and mark a
significant performance boost as more experts are integrated. We have
open-sourced the training code used in this report. All of these resources can
be found on our project website.