掌握任意區域:面向多模態大語言模型的精確、上下文像素理解
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
October 21, 2025
作者: Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
cs.AI
摘要
儘管多模態大型語言模型(MLLMs)在整體理解方面表現卓越,但在捕捉複雜場景的密集世界時卻顯得力不從心,這需要對精細細節和物體間關係進行細粒度分析。區域級別的MLLMs已邁出了有希望的一步。然而,以往的嘗試通常僅優化於孤立理解給定區域,忽略了關鍵的全局上下文。為解決此問題,我們引入了「掌握任意區域」(GAR)以實現全面的區域級視覺理解。借助有效的RoI對齊特徵重播技術,GAR支持:(1) 通過利用必要的全局上下文實現精確感知,以及(2) 建模多個提示之間的交互。由此,它自然實現了(3) 高級組合推理,以回答關於任何區域的特定自由形式問題,將範式從被動描述轉向主動對話。此外,我們構建了GAR-Bench,不僅提供了對單一區域理解的更準確評估,更重要的是,衡量了跨多個區域的交互和複雜推理。大量實驗表明,GAR-1B不僅保持了最先進的標題生成能力,例如在DLC-Bench上超越DAM-3B +4.5,而且在建模多個提示之間的關係方面表現出色,具備高級理解能力,甚至在GAR-Bench-VQA上超越了InternVL3-78B。更重要的是,我們的零樣本GAR-8B在VideoRefer-BenchQ上甚至超越了領域內VideoRefer-7B,表明其強大能力可輕鬆遷移至視頻領域。
English
While Multimodal Large Language Models (MLLMs) excel at holistic
understanding, they struggle in capturing the dense world with complex scenes,
requiring fine-grained analysis of intricate details and object
inter-relationships. Region-level MLLMs have been a promising step. However,
previous attempts are generally optimized to understand given regions in
isolation, neglecting crucial global contexts. To address this, we introduce
Grasp Any Region (GAR) for comprehen- sive region-level visual understanding.
Empowered by an effective RoI-aligned feature replay technique, GAR supports
(1) precise perception by leveraging necessary global contexts, and (2)
modeling interactions between multiple prompts. Together, it then naturally
achieves (3) advanced compositional reasoning to answer specific free-form
questions about any region, shifting the paradigm from passive description to
active dialogue. Moreover, we construct GAR-Bench, which not only provides a
more accurate evaluation of single-region comprehension, but also, more
importantly, measures interactions and complex reasoning across multiple
regions. Extensive experiments have demonstrated that GAR-1B not only maintains
the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5
on DLC-Bench, but also excels at modeling relationships between multiple
prompts with advanced comprehension capabilities, even surpassing InternVL3-78B
on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms
in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong
capabilities can be easily transferred to videos.