LEGO：語言增強多模態基礎模型

摘要

多模式大型語言模型在不同模態下的各種任務中展現出令人印象深刻的表現。然而，現有的多模式模型主要強調捕捉每個模態內的全局信息，卻忽略了跨模態感知局部信息的重要性。因此，這些模型缺乏有效理解輸入數據的細節，限制了它們在需要更細緻理解的任務中的表現。為了解決這一限制，迫切需要開發能夠實現跨多模式細粒度理解的模型，從而提高它們對各種任務的適用性。在本文中，我們提出LEGO，一種語言增強的多模式基礎模型。除了像其他多模式模型一樣捕捉全局信息外，我們提出的模型擅長處理需要對輸入中的局部信息進行詳細理解的任務。它展示了對圖像中特定區域或視頻中特定時刻的精確識別和定位。為實現此目標，我們設計了一個多樣化的數據集構建流程，從而產生了一個用於模型訓練的多模式、多粒度數據集。我們的模型的代碼、數據集和演示可在https://github.com/lzw-lzw/LEGO 找到。

English

Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose LEGO, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/LEGO.

LEGO：語言增強多模態基礎模型

LEGO:Language Enhanced Multi-modal Grounding Model

摘要

Support