Meteor:基于Mamba的大型语言和视觉模型理由遍历
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
May 24, 2024
作者: Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro
cs.AI
摘要
大型语言和视觉模型(LLVMs)的快速发展是由视觉指导调整的进展推动的。最近,开源LLVMs已经策划了高质量的视觉指导调整数据集,并利用了额外的视觉编码器或多个计算机视觉模型,以缩小与功能强大的闭源LLVMs之间的性能差距。这些进展归因于对多样化能力所需的多方面信息,包括基本图像理解、关于常识和非物体概念的现实世界知识(例如图表、图表、符号、标志和数学问题),以及解决复杂问题的逐步步骤。借鉴多方面信息,我们提出了一种新的高效LLVM,基于Mamba的理性遍历(Meteor),利用多方面的理性来增强理解和回答能力。为了嵌入包含丰富信息的长理性,我们采用了能够以线性时间复杂度处理序列数据的Mamba架构。我们引入了一种促进理性高效嵌入的理性遍历新概念。随后,通过这些步骤,Meteor在多个需要多样化能力的评估基准上实现了视觉语言性能的显着改进,而无需扩大模型规模或使用额外的视觉编码器和计算机视觉模型。
English
The rapid development of large language and vision models (LLVMs) has been
driven by advances in visual instruction tuning. Recently, open-source LLVMs
have curated high-quality visual instruction tuning datasets and utilized
additional vision encoders or multiple computer vision models in order to
narrow the performance gap with powerful closed-source LLVMs. These
advancements are attributed to multifaceted information required for diverse
capabilities, including fundamental image understanding, real-world knowledge
about common-sense and non-object concepts (e.g., charts, diagrams, symbols,
signs, and math problems), and step-by-step procedures for solving complex
questions. Drawing from the multifaceted information, we present a new
efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages
multifaceted rationale to enhance understanding and answering capabilities. To
embed lengthy rationales containing abundant information, we employ the Mamba
architecture, capable of processing sequential data with linear time
complexity. We introduce a new concept of traversal of rationale that
facilitates efficient embedding of rationale. Subsequently, the backbone
multimodal language model (MLM) is trained to generate answers with the aid of
rationale. Through these steps, Meteor achieves significant improvements in
vision language performances across multiple evaluation benchmarks requiring
diverse capabilities, without scaling up the model size or employing additional
vision encoders and computer vision models.Summary
AI-Generated Summary