대규모 다중모달 모델은 결함이 있는 입력을 능동적으로 인식할 수 있는가? 입력 검토 능력에 대한 체계적 평가 프레임워크

초록

대규모 멀티모달 모델(Large Multimodal Models, LMMs)은 복잡한 멀티모달 작업을 처리하는 데 있어 탁월한 성능을 보이며 주목할 만한 성장을 이루어 왔다. 최근 연구는 대규모 언어 모델이 결함이 있는 입력을 수동적으로 수용하는 경향이 있으며, 이로 인해 잘못된 프롬프트에 대한 무의미한 추론이 발생하는 경우가 많음을 강조했다. 그러나 LMMs가 능동적으로 잘못된 입력을 탐지하고 검토할 수 있는지에 대한 동일한 중요한 질문은 여전히 탐구되지 않은 상태로 남아 있다. 이러한 격차를 해결하기 위해, 우리는 결함이 있는 전제의 일곱 가지 범주와 세 가지 평가 지표를 포함한 입력 검토 능력 평가 프레임워크(Input Scrutiny Ability Evaluation Framework, ISEval)를 소개한다. 우리는 10개의 고급 LMMs에 대한 광범위한 평가를 통해 주요 발견을 도출했다. 대부분의 모델은 지침 없이 결함이 있는 텍스트 전제를 능동적으로 탐지하는 데 어려움을 겪으며, 이는 전제 오류 식별에 대한 명시적 프롬프트에 대한 강한 의존성을 반영한다. 오류 유형에 따라 성능이 달라지는데, 모델들은 논리적 오류를 식별하는 데는 뛰어나지만 표면적 언어 오류와 특정 조건적 결함에는 어려움을 겪는다. 모달리티 신뢰도는 다양하게 나타나는데, Gemini 2.5 pro와 Claude Sonnet 4는 시각적 정보와 텍스트 정보를 균형 있게 처리하는 반면, aya-vision-8b는 충돌 상황에서 텍스트에 지나치게 의존한다. 이러한 통찰은 LMMs의 입력 유효성에 대한 능동적 검증을 강화할 필요성을 강조하며, 이 문제를 완화하기 위한 새로운 통찰을 제공한다. 코드는 https://github.com/MLGroupJLU/LMM_ISEval에서 확인할 수 있다.

English

Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.

대규모 다중모달 모델은 결함이 있는 입력을 능동적으로 인식할 수 있는가? 입력 검토 능력에 대한 체계적 평가 프레임워크

Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

초록

Support