Patch-as-Decodable-Token：邁向MLLMs中統一的多模態視覺任務

摘要

近年來，多模態大型語言模型（MLLMs）取得了快速進展。然而，現有的視覺任務方法往往依賴於間接表示，例如將檢測座標生成為文本，這限制了性能並阻礙了如分割等密集預測任務的實現。為克服這些挑戰，我們引入了「Patch-as-Decodable Token」（PaDT），這是一種統一範式，使MLLMs能夠直接生成文本和多樣化的視覺輸出。PaDT的核心是視覺參考標記（VRTs），這些標記源自查詢圖像的視覺補丁嵌入，並與LLM的輸出文本標記無縫交織。一個輕量級的解碼器隨後將LLM的輸出轉化為檢測、分割和定位預測。與先前方法不同，PaDT在每次前向傳播時獨立處理VRTs，並動態擴展嵌入表，從而提高了相似物體的定位和區分能力。我們進一步為PaDT定制了訓練策略，通過隨機選擇VRTs進行監督微調，並引入了一種魯棒的逐標記交叉熵損失。我們在四項視覺感知與理解任務上的實證研究表明，PaDT始終實現了最先進的性能，即使與顯著更大的MLLM模型相比也是如此。代碼可在https://github.com/Gorilla-Lab-SCUT/PaDT獲取。

English

Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

Patch-as-Decodable-Token：邁向MLLMs中統一的多模態視覺任務

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

摘要

Support