Meteor: 基於 Mamba 的大型語言和視覺模型的理性遍歷
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
May 24, 2024
作者: Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro
cs.AI
摘要
大型語言和視覺模型(LLVMs)的快速發展是由於視覺指導調整的進步。最近,開源的LLVMs已經整理了高質量的視覺指導調整數據集,並利用額外的視覺編碼器或多個計算機視覺模型,以縮小與功能強大的封閉源LLVMs之間的性能差距。這些進展歸因於對多樣能力所需的多面信息,包括基本圖像理解、關於常識和非物體概念(例如圖表、圖解、符號、標誌和數學問題)的現實世界知識,以及解決複雜問題的逐步程序。借鑒多面信息,我們提出了一種新的高效LLVM,基於Mamba的理性遍歷(Meteor),利用多面理性來增強理解和回答能力。為了嵌入包含豐富信息的冗長理性,我們採用了具有線性時間複雜度的Mamba架構,可以處理序列數據。我們引入了一個新概念的理性遍歷,促進理性的高效嵌入。隨後,骨幹多模態語言模型(MLM)被訓練來生成答案,並借助理性。通過這些步驟,Meteor在多個評估基準上實現了顯著的視覺語言性能改進,這些基準需要多樣能力,而無需擴大模型大小或使用額外的視覺編碼器和計算機視覺模型。
English
The rapid development of large language and vision models (LLVMs) has been
driven by advances in visual instruction tuning. Recently, open-source LLVMs
have curated high-quality visual instruction tuning datasets and utilized
additional vision encoders or multiple computer vision models in order to
narrow the performance gap with powerful closed-source LLVMs. These
advancements are attributed to multifaceted information required for diverse
capabilities, including fundamental image understanding, real-world knowledge
about common-sense and non-object concepts (e.g., charts, diagrams, symbols,
signs, and math problems), and step-by-step procedures for solving complex
questions. Drawing from the multifaceted information, we present a new
efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages
multifaceted rationale to enhance understanding and answering capabilities. To
embed lengthy rationales containing abundant information, we employ the Mamba
architecture, capable of processing sequential data with linear time
complexity. We introduce a new concept of traversal of rationale that
facilitates efficient embedding of rationale. Subsequently, the backbone
multimodal language model (MLM) is trained to generate answers with the aid of
rationale. Through these steps, Meteor achieves significant improvements in
vision language performances across multiple evaluation benchmarks requiring
diverse capabilities, without scaling up the model size or employing additional
vision encoders and computer vision models.Summary
AI-Generated Summary