ChatPaper.aiChatPaper

GroundingME:通过多维度评估揭示MLLMs中的视觉定位差距

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

December 19, 2025
作者: Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo
cs.AI

摘要

视觉定位技术旨在通过自然语言描述实现物体定位,是连接语言与视觉理解的关键桥梁。尽管多模态大语言模型(MLLMs)在现有基准测试中表现优异,但核心问题依然存在:它们究竟是真正实现了类人水平的语言-视觉关联能力,还是仅仅在简化数据集上进行了模式匹配?现有基准未能涵盖人类能轻松应对模糊指代、识别不可定位场景的真实世界复杂性。为严格评估MLLMs的真实能力,我们提出GroundingME基准,从四个关键维度系统化挑战模型性能:(1)区分性——辨别高度相似物体;(2)空间性——理解复杂关系描述;(3)局限性——处理遮挡或微小物体;(4)拒斥性——识别不可定位查询。通过自动化生成与人工验证相结合的方式,我们精心构建了1,005个反映真实世界复杂性的挑战性样本。对25个前沿MLLMs的评估揭示了显著的能力鸿沟:最佳模型准确率仅达45.1%,而多数模型在拒斥任务中得分为0%,它们会反射性地幻觉出不存在物体而非承认其缺失,这为实际部署敲响安全警钟。我们探索了两种改进策略:(1)通过测试时思维轨迹缩放策略选择最优响应,将复杂场景定位准确率提升2.9%;(2)采用混合数据训练使模型学会识别不可定位查询,将拒斥准确率从0%提升至27.9%。GroundingME既可作为揭示MLLMs当前局限性的诊断工具,也为实现人类水平视觉定位提供了发展路线图。
English
Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.
PDF142December 23, 2025