大規模マルチモーダルモデルは不良入力を積極的に認識できるか？その入力審査能力の体系的評価フレームワーク

要旨

大規模マルチモーダルモデル（LMMs）は、複雑なマルチモーダルタスクを卓越した性能で処理する能力を示し、著しい成長を遂げてきた。最近の研究では、大規模言語モデルが欠陥のある入力を受動的に受け入れ、無効なプロンプトに対して無駄な推論を行う傾向があることが指摘されている。しかし、LMMsが能動的に誤った入力を検出し、精査できるかどうかという重要な問題は未だに未解明のままである。このギャップを埋めるため、我々は「入力精査能力評価フレームワーク（ISEval）」を導入し、7つのカテゴリーの欠陥前提と3つの評価指標を包含する。10の先進的なLMMsに対する広範な評価を通じて、重要な知見が得られた。ほとんどのモデルは、ガイダンスなしで欠陥のあるテキスト前提を能動的に検出するのに苦労しており、前提エラーの識別において明示的なプロンプトへの強い依存が反映されている。エラータイプが性能に影響を与える：モデルは論理的誤謬の識別に優れているが、表面的な言語エラーや特定の条件付き欠陥には苦戦する。モダリティへの信頼度はモデルによって異なり、Gemini 2.5 proとClaude Sonnet 4は視覚情報とテキスト情報のバランスを取るが、aya-vision-8bは衝突時にテキストに過度に依存する。これらの知見は、LMMsの入力有効性に対する能動的な検証能力を強化する緊急性を強調し、この問題を緩和するための新たな洞察を提供する。コードはhttps://github.com/MLGroupJLU/LMM_ISEvalで公開されている。

English

Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.

大規模マルチモーダルモデルは不良入力を積極的に認識できるか？その入力審査能力の体系的評価フレームワーク

Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

要旨

Support