可验证推理的多模态事实级归因
Multimodal Fact-Level Attribution for Verifiable Reasoning
February 12, 2026
作者: David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
cs.AI
摘要
多模态大语言模型(MLLMs)正日益应用于涉及多步推理和长文本生成的实际任务中,其可靠性要求模型输出需基于异构输入源并验证每个事实主张。然而,现有的多模态基础基准和评估方法聚焦于简化的、基于观察的场景或有限模态,未能评估复杂多模态推理中的归因能力。我们提出MuRGAt(基于归因的多模态推理基准),该基准用于评估需要超越直接观察的推理场景中事实级多模态归因能力。在输入涵盖视频、音频等多模态内容时,MuRGAt要求模型生成带有显式推理过程和精确引证的答案,每个引证需同时注明模态类型和时间片段。为实现可靠评估,我们开发了与人类判断高度相关的自动评估框架。通过人工与自动化评分对比发现,即使是强大的MLLMs也常在正确推理的同时产生虚假引证。此外,我们观察到关键权衡:增加推理深度或强制结构化基础往往会降低准确性,这揭示了内部推理与可验证归因之间的显著差距。
English
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.