NExT-Chat：用於聊天、檢測和分割的LMM

摘要

大型語言模型（LLMs）的發展極大地推動了多模式理解領域的發展，導致大型多模式模型（LMMs）的出現。為了提高視覺理解水平，最近的研究將LMMs配備了區域級理解能力，通過將物體邊界框坐標表示為一系列文本序列（pixel2seq）。本文介紹了一種新的物體定位建模範式，稱為pixel2emb方法，其中我們要求LMM輸出位置嵌入，然後由不同的解碼器解碼。這種範式允許在多模式對話中使用不同的位置格式（如邊界框和遮罩）。此類基於嵌入的位置建模還能利用現有的本地化任務實踐，如檢測和分割。在資源有限的情況下，我們的pixel2emb在公平比較下展示了比現有最先進方法（SOTA）更優異的性能，無論是在位置輸入還是輸出任務中。利用提出的pixel2emb方法，我們訓練了一個名為NExT-Chat的LMM，展示了其處理多任務能力，如視覺對應、區域標題和基於理由的能力。

English

The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance the level of visual comprehension, recent studies have equipped LMMs with region-level understanding capabilities by representing object bounding box coordinates as a series of text sequences (pixel2seq). In this paper, we introduce a novel paradigm for object location modeling called pixel2emb method, where we ask the LMM to output the location embeddings and then decoded by different decoders. This paradigm allows for different location formats (such as bounding boxes and masks) to be used in multimodal conversations Furthermore, this kind of embedding based location modeling enables the utilization of existing practices in localization tasks, such as detection and segmentation. In scenarios with limited resources, our pixel2emb demonstrates superior performance compared to existing state-of-the-art (SOTA) approaches in both the location input and output tasks under fair comparison. Leveraging the proposed pixel2emb method, we train an LMM named NExT-Chat and demonstrate its capability of handling multiple tasks like visual grounding, region caption, and grounded reasoning.

NExT-Chat：用於聊天、檢測和分割的LMM

NExT-Chat: An LMM for Chat, Detection and Segmentation

摘要

Support