大型多模态模型能否主动识别错误输入？对其输入审查能力的系统性评估框架

摘要

大型多模态模型（LMMs）已展现出显著的增长，在处理复杂多模态任务时表现出卓越的性能。然而，近期研究指出，大型语言模型倾向于被动接受有缺陷的输入，往往导致在无效提示下进行无效推理。尽管如此，关于LMMs是否能够主动检测并审查错误输入的关键问题仍未得到充分探讨。为填补这一空白，我们引入了输入审查能力评估框架（ISEval），该框架涵盖了七类错误前提及三项评估指标。通过对十种先进LMMs的广泛评估，我们得出了关键发现。大多数模型在无指导情况下难以主动检测出有缺陷的文本前提，这反映出它们对明确提示的强烈依赖，以便识别前提错误。错误类型影响性能：模型在识别逻辑谬误方面表现出色，但在处理表层语言错误及某些条件性缺陷时则显得力不从心。模态信任度各异——Gemini 2.5 pro与Claude Sonnet 4在视觉与文本信息间取得平衡，而aya-vision-8b在冲突中过度依赖文本。这些发现强调了提升LMMs主动验证输入有效性的迫切需求，并为缓解该问题提供了新的见解。相关代码已发布于https://github.com/MLGroupJLU/LMM_ISEval。

English

Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.

大型多模态模型能否主动识别错误输入？对其输入审查能力的系统性评估框架

Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

摘要

Support