FINER: 세부적인 부정 질문에서 MLLM의 환각 현상

초록

멀티모달 대규모 언어 모델(MLLM)은 환각 현상, 특히 세부적인 질의에서 어려움을 겪는데, 이는 기존 벤치마크가 대략적인 이미지 관련 질문에 초점을 맞춤으로써 제대로 평가되지 않는 과제입니다. 우리는 FIne-grained NEgative queRies(FINER)와 두 가지 벤치마크인 FINER-CompreCap 및 FINER-DOCCI를 소개합니다. FINER를 사용하여 다중 객체, 다중 속성, 다중 관계, 그리고 "무엇" 질문이라는 네 가지 설정에서 환각 현상을 분석합니다. 우리의 벤치마크는 MLLM이 이미지 내에 실제로 존재하는 요소들과 세부적인 불일치가 동시에 발생할 때 환각 현상을 보인다는 것을 밝혀냅니다. 이를 해결하기 위해 FINER에서 영감을 받은 데이터에 직접 선호도 최적화(DPO)를 활용한 FINER-Tuning을 제안합니다. 4개의 최첨단 MLLM을 FINER-Tuning으로 미세 조정한 결과, 우리 벤치마크의 환각 현상에서 최대 24.2%(InternVL3.5-14B)의 성능 향상을 보였으며, 동시에 기존 8개 환각 평가 스위트에서의 성능도 개선되고 6개 벤치마크에 걸친 일반 멀티모달 능력도 향상되었습니다. 코드, 벤치마크 및 모델은 https://explainableml.github.io/finer-project/에서 이용할 수 있습니다.

English

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at https://explainableml.github.io/finer-project/{https://explainableml.github.io/finer-project/}.

FINER: 세부적인 부정 질문에서 MLLM의 환각 현상

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

초록

Support