超越单模态捷径:多模态大语言模型作为跨模态推理器的接地命名实体识别
Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition
February 4, 2026
作者: Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, Min Zhang
cs.AI
摘要
基于视觉定位的多模态命名实体识别(GMNER)旨在提取文本实体、分配语义类别并将其关联至对应视觉区域。本研究探索了多模态大语言模型(MLLMs)以端到端方式执行GMNER任务的潜力,突破其在级联管道中作为辅助工具的传统定位。关键发现表明,MLLMs存在模态偏差(包括视觉偏差与文本偏差),其根源在于模型倾向于采用单模态捷径而非严格的跨模态验证。为此,我们提出模态感知一致性推理(MCR)方法,通过多风格推理模式注入(MRSI)与约束引导可验证优化(CVO)实现结构化跨模态推理。MRSI将抽象约束转化为可执行推理链,CVO则通过群体相对策略优化(GRPO)使模型动态对齐推理轨迹。在GMNER和视觉定位任务上的实验表明,MCR能有效缓解模态偏差,较现有基线方法展现出更优性能。
English
Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit modality bias, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning (MCR), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.