LEGO：语言增强多模态基础模型

摘要

多模态大型语言模型在不同模态的各种任务中展现出令人印象深刻的性能。然而，现有的多模态模型主要强调捕获每个模态内的全局信息，却忽视了跨模态感知局部信息的重要性。因此，这些模型缺乏有效理解输入数据的细粒度细节的能力，从而限制了它们在需要更加细致理解的任务中的性能。为了解决这一局限性，迫切需要开发能够实现跨多模态细粒度理解的模型，从而增强它们在各种任务中的适用性。在本文中，我们提出LEGO，一种语言增强的多模态基础模型。除了像其他多模态模型一样捕获全局信息之外，我们提出的模型擅长处理需要对输入中的局部信息进行详细理解的任务。它展示了对图像中特定区域或视频中特定时刻的精确识别和定位。为实现这一目标，我们设计了多样化的数据集构建流程，生成了一个多模态、多粒度的数据集用于模型训练。我们的模型的代码、数据集和演示可以在https://github.com/lzw-lzw/LEGO 找到。

English

Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose LEGO, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/LEGO.

LEGO：语言增强多模态基础模型

LEGO:Language Enhanced Multi-modal Grounding Model

摘要

Support