Patch-as-Decodable-Token：迈向MLLMs中统一多模态视觉任务

摘要

近年来，多模态大语言模型（MLLMs）发展迅速。然而，现有视觉任务方法常依赖间接表示，如将检测坐标生成为文本，这限制了性能并阻碍了分割等密集预测任务的实现。为克服这些挑战，我们提出了“Patch-as-Decodable Token”（PaDT）这一统一范式，使MLLMs能直接生成文本及多样化的视觉输出。PaDT的核心在于视觉参考标记（VRTs），它们源自查询图像的视觉补丁嵌入，并与大语言模型输出的文本标记无缝交织。随后，一个轻量级解码器将大语言模型的输出转化为检测、分割及定位预测。与先前方法不同，PaDT在每次前向传播中独立处理VRTs，并动态扩展嵌入表，从而提升定位能力及相似物体间的区分度。我们进一步为PaDT定制了训练策略，通过随机选择VRTs进行监督微调，并引入鲁棒的逐标记交叉熵损失。在四项视觉感知与理解任务上的实证研究表明，PaDT持续达到最先进性能，即便与规模显著更大的MLLM模型相比亦不逊色。代码已发布于https://github.com/Gorilla-Lab-SCUT/PaDT。

English

Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

Patch-as-Decodable-Token：迈向MLLMs中统一多模态视觉任务

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

摘要

Support