大型多模态模型能否主动识别错误输入？对其输入审查能力的系统性评估框架

摘要

大型多模态模型（LMMs）已展现出显著的发展，在处理复杂多模态任务时表现出卓越的能力。然而，近期研究指出，大型语言模型往往被动接受有缺陷的输入，导致在无效提示下进行无效推理。尽管如此，关于LMMs能否主动检测并审查错误输入的关键问题仍未得到充分探索。为填补这一空白，我们提出了输入审查能力评估框架（ISEval），该框架涵盖七类错误前提及三项评估指标。通过对十种先进LMMs的广泛评估，我们得出了重要发现：大多数模型在无指导情况下难以主动识别有缺陷的文本前提，显示出对明确提示的强烈依赖；错误类型影响性能，模型在识别逻辑谬误方面表现优异，但在处理表层语言错误及特定条件缺陷时则显吃力；模态信任度各异——Gemini 2.5 pro与Claude Sonnet 4在视觉与文本信息间取得平衡，而aya-vision-8b在冲突中过度依赖文本。这些发现强调了提升LMMs主动验证输入有效性的迫切需求，并为缓解该问题提供了新的见解。相关代码已发布于https://github.com/MLGroupJLU/LMM_ISEval。

English

Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.

大型多模态模型能否主动识别错误输入？对其输入审查能力的系统性评估框架

Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

摘要

Support