NExT-Chat：用于聊天、检测和分割的语言模型

摘要

大型语言模型（LLMs）的发展极大地推动了多模态理解领域的发展，导致了大型多模态模型（LMMs）的出现。为了增强视觉理解水平，最近的研究通过将物体边界框坐标表示为一系列文本序列（pixel2seq），为LMMs配备了区域级理解能力。本文介绍了一种用于对象定位建模的新范式，称为pixel2emb方法，其中我们要求LMM输出位置嵌入，然后由不同的解码器解码。这种范式允许在多模态对话中使用不同的位置格式（如边界框和蒙版）。此类基于嵌入的位置建模还能够利用定位任务中的现有实践，如检测和分割。在资源有限的情况下，我们的pixel2emb在公平比较下展示出比现有最先进方法（SOTA）更优异的性能，无论是在位置输入还是输出任务中。利用提出的pixel2emb方法，我们训练了一个名为NExT-Chat的LMM，并展示了它处理多任务的能力，如视觉定位、区域描述和基于事实的推理。

English

The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance the level of visual comprehension, recent studies have equipped LMMs with region-level understanding capabilities by representing object bounding box coordinates as a series of text sequences (pixel2seq). In this paper, we introduce a novel paradigm for object location modeling called pixel2emb method, where we ask the LMM to output the location embeddings and then decoded by different decoders. This paradigm allows for different location formats (such as bounding boxes and masks) to be used in multimodal conversations Furthermore, this kind of embedding based location modeling enables the utilization of existing practices in localization tasks, such as detection and segmentation. In scenarios with limited resources, our pixel2emb demonstrates superior performance compared to existing state-of-the-art (SOTA) approaches in both the location input and output tasks under fair comparison. Leveraging the proposed pixel2emb method, we train an LMM named NExT-Chat and demonstrate its capability of handling multiple tasks like visual grounding, region caption, and grounded reasoning.

NExT-Chat：用于聊天、检测和分割的语言模型

NExT-Chat: An LMM for Chat, Detection and Segmentation

摘要

Support