MouSi：多模态视觉专家视觉语言模型

摘要

当前的大型视觉-语言模型（VLMs）经常面临挑战，例如单个视觉组件能力不足和过长的视觉标记。这些问题可能限制模型在准确解释复杂视觉信息和过长上下文信息方面的有效性。解决这些挑战对于提升VLMs的性能和适用性至关重要。本文提出使用集成专家技术来协同个别视觉编码器的能力，包括擅长图像-文本匹配、OCR、图像分割等。该技术引入融合网络来统一处理来自不同视觉专家的输出，同时弥合图像编码器和预训练LLMs之间的差距。此外，我们探索不同的位置编码方案，以减轻由于长度图像特征序列而导致的位置编码浪费，有效解决位置溢出和长度限制问题。例如，在我们的实现中，这一技术将模型中的位置占用显著减少，例如在SAM模型中，从显著的4096减少到更高效和可管理的64甚至1。实验结果表明，具有多个专家的VLMs表现出比孤立视觉编码器更优越的性能，并且随着集成更多专家，性能得到显著提升。我们已经在本报告中开源了训练代码。所有这些资源都可以在我们的项目网站上找到。

English

Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.

MouSi：多模态视觉专家视觉语言模型

MouSi: Poly-Visual-Expert Vision-Language Models

摘要

Support