PerceptionDLM：基于多模态扩散语言模型的并行区域感知

摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，现有的大多数MLLMs依赖于自回归生成方式，这限制了它们在需要为多个区域生成描述性标注的感知任务中的效率。在本工作中，我们提出了PerceptionDLM——一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base（一个在开源扩散MLLMs中达到最先进性能的强基线模型），我们的架构充分利用了扩散语言模型（DLMs）的并行解码特性。具体而言，我们引入了高效的提示机制和结构化注意力掩码，使得模型能够同时感知多个被掩码的区域，从而在序列级别和词元级别并行生成区域描述。与现有顺序处理区域的方法相比，这种设计显著提升了推理效率。为了系统评估DLMs视觉感知能力的并行性，我们通过将DLC-Bench扩展至每张图像包含多个区域掩码，构建了新的并行详细定位字幕基准（ParaDLC-Bench），实现了对字幕质量和推理效率的联合评估。实验表明，PerceptionDLM在保持区域字幕生成竞争性性能的同时，在多区域感知任务中实现了显著的速度提升。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知方面的潜力。据我们所知，我们是首个利用扩散语言模型优势实现并行区域字幕生成与感知的工作。代码、模型和数据集均已开源。

English

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.