LEGO:語言增強多模態基礎模型
LEGO:Language Enhanced Multi-modal Grounding Model
January 11, 2024
作者: Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Van Tu Vu, Zhida Huang, Tao Wang
cs.AI
摘要
多模式大型語言模型在不同模態下的各種任務中展現出令人印象深刻的表現。然而,現有的多模式模型主要強調捕捉每個模態內的全局信息,卻忽略了跨模態感知局部信息的重要性。因此,這些模型缺乏有效理解輸入數據的細節,限制了它們在需要更細緻理解的任務中的表現。為了解決這一限制,迫切需要開發能夠實現跨多模式細粒度理解的模型,從而提高它們對各種任務的適用性。在本文中,我們提出LEGO,一種語言增強的多模式基礎模型。除了像其他多模式模型一樣捕捉全局信息外,我們提出的模型擅長處理需要對輸入中的局部信息進行詳細理解的任務。它展示了對圖像中特定區域或視頻中特定時刻的精確識別和定位。為實現此目標,我們設計了一個多樣化的數據集構建流程,從而產生了一個用於模型訓練的多模式、多粒度數據集。我們的模型的代碼、數據集和演示可在https://github.com/lzw-lzw/LEGO 找到。
English
Multi-modal large language models have demonstrated impressive performance
across various tasks in different modalities. However, existing multi-modal
models primarily emphasize capturing global information within each modality
while neglecting the importance of perceiving local information across
modalities. Consequently, these models lack the ability to effectively
understand the fine-grained details of input data, limiting their performance
in tasks that require a more nuanced understanding. To address this limitation,
there is a compelling need to develop models that enable fine-grained
understanding across multiple modalities, thereby enhancing their applicability
to a wide range of tasks. In this paper, we propose LEGO, a language enhanced
multi-modal grounding model. Beyond capturing global information like other
multi-modal models, our proposed model excels at tasks demanding a detailed
understanding of local information within the input. It demonstrates precise
identification and localization of specific regions in images or moments in
videos. To achieve this objective, we design a diversified dataset construction
pipeline, resulting in a multi-modal, multi-granularity dataset for model
training. The code, dataset, and demo of our model can be found at https:
//github.com/lzw-lzw/LEGO.