掌握任意区域:迈向多模态大语言模型的精确、上下文感知像素理解
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
October 21, 2025
作者: Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
cs.AI
摘要
尽管多模态大语言模型(MLLMs)在整体理解方面表现出色,但在处理包含复杂场景的密集世界时却显得力不从心,这需要对错综复杂的细节及物体间关系进行精细分析。区域级MLLMs为此迈出了有希望的一步。然而,以往的研究多局限于孤立理解给定区域,忽视了关键的全局上下文信息。为解决这一问题,我们提出了“掌握任意区域”(Grasp Any Region, GAR)方法,旨在实现全面的区域级视觉理解。借助高效的RoI对齐特征重放技术,GAR支持:(1)通过利用必要的全局上下文实现精准感知;(2)建模多个提示之间的交互。由此,GAR自然实现了(3)高级的组合推理能力,能够回答关于任何区域的特定自由形式问题,从而将范式从被动描述转向主动对话。此外,我们构建了GAR-Bench,它不仅为单一区域理解提供了更精确的评估,更重要的是,能够衡量跨多个区域的交互及复杂推理能力。大量实验证明,GAR-1B不仅保持了最先进的图像描述能力,如在DLC-Bench上超越DAM-3B达4.5分,还在建模多提示间关系及高级理解能力方面表现卓越,甚至在GAR-Bench-VQA上超越了InternVL3-78B。尤为重要的是,我们的零样本GAR-8B在VideoRefer-BenchQ上甚至优于领域内VideoRefer-7B,表明其强大能力可轻松迁移至视频领域。
English
While Multimodal Large Language Models (MLLMs) excel at holistic
understanding, they struggle in capturing the dense world with complex scenes,
requiring fine-grained analysis of intricate details and object
inter-relationships. Region-level MLLMs have been a promising step. However,
previous attempts are generally optimized to understand given regions in
isolation, neglecting crucial global contexts. To address this, we introduce
Grasp Any Region (GAR) for comprehen- sive region-level visual understanding.
Empowered by an effective RoI-aligned feature replay technique, GAR supports
(1) precise perception by leveraging necessary global contexts, and (2)
modeling interactions between multiple prompts. Together, it then naturally
achieves (3) advanced compositional reasoning to answer specific free-form
questions about any region, shifting the paradigm from passive description to
active dialogue. Moreover, we construct GAR-Bench, which not only provides a
more accurate evaluation of single-region comprehension, but also, more
importantly, measures interactions and complex reasoning across multiple
regions. Extensive experiments have demonstrated that GAR-1B not only maintains
the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5
on DLC-Bench, but also excels at modeling relationships between multiple
prompts with advanced comprehension capabilities, even surpassing InternVL3-78B
on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms
in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong
capabilities can be easily transferred to videos.