將視覺感知標記引入多模態大型語言模型

摘要

為了利用視覺資訊，多模態大型語言模型（MLLM）依賴其視覺編碼器的感知過程。視覺感知的完整性和準確性顯著影響空間推理、細粒度理解等任務的精確度。然而，MLLM目前仍缺乏自主控制其視覺感知過程的能力，例如選擇性地審查圖像的特定區域或聚焦於與特定物件類別相關的資訊。在本研究中，我們提出了「視覺感知標記」的概念，旨在賦予MLLM一種機制來控制其視覺感知過程。我們設計了兩種類型的視覺感知標記，分別稱為「區域選擇標記」和「視覺重新編碼標記」。MLLM自主生成這些標記，就像生成文本一樣，並利用它們觸發額外的視覺感知動作。區域選擇標記明確識別圖像中需要進一步感知的特定區域，而視覺重新編碼標記則利用其隱藏狀態作為控制信號，引導額外的視覺感知過程。大量實驗證明了這些標記在處理空間推理、提升細粒度理解等任務中的優勢。平均而言，引入視覺感知標記使一個20億參數模型的性能提升了23.6%，其得分從0.572提高至0.708，甚至比一個70億參數模型高出13.4%（從0.624）。請查看我們的代碼庫：https://github.com/yu-rp/VisualPerceptionToken。

English

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

將視覺感知標記引入多模態大型語言模型

Introducing Visual Perception Token into Multimodal Large Language Model

摘要

Support