超越單模態捷徑：多模態大語言模型作為跨模態推理器的實體定位命名實體識別

摘要

基于多模态大语言模型的接地命名实体识别（GMNER）旨在提取基于文本的实体、为其分配语义类别，并将其定位至对应的视觉区域。本研究探索了多模态大语言模型以端到端方式执行GMNER任务的潜力，突破其传统上仅作为级联流程辅助工具的局限。关键发现表明，MLLMs存在模态偏差（包括视觉偏差与文本偏差）这一根本性挑战，源于模型倾向于采用单模态捷径而非严格的跨模态验证。为此，我们提出模态感知一致性推理框架，通过多风格推理模式注入和约束引导可验证优化实现结构化跨模态推理。MRSI将抽象约束转化为可执行的推理链，而CVO使模型能够通过群体相对策略优化动态校准推理轨迹。在GMNER和视觉定位任务上的实验表明，MCR能有效缓解模态偏差，较现有基线方法展现出更优性能。

English

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit modality bias, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning (MCR), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.

超越單模態捷徑：多模態大語言模型作為跨模態推理器的實體定位命名實體識別

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

摘要

Support