FINER: 細粒度の否定的クエリにおけるMLLMの幻覚生成

要旨

マルチモーダル大規模言語モデル（MLLM）は、特に細粒度のクエリにおいて幻覚（hallucination）に悩まされており、既存のベンチマークは粗い画像関連の質問に焦点を当てているため、この課題が十分に反映されていません。我々は、FIne-grained NEgative queRies（FINER）と、2つのベンチマークであるFINER-CompreCapおよびFINER-DOCCIを提案します。FINERを用いて、複数オブジェクト、複数属性、複数関係、および「何」という質問という4つの設定における幻覚を分析します。我々のベンチマークは、細粒度の不一致が画像内に実際に存在する要素と同時に発生する場合にMLLMが幻覚を生じることを明らかにします。この問題に対処するため、FINERに着想を得たデータに対してDirect Preference Optimization（DPO）を適用したFINER-Tuningを提案します。4つの先進的MLLMをFINER-Tuningでファインチューニングした結果、我々のベンチマークにおける幻覚が最大24.2％（InternVL3.5-14B）改善されると同時に、既存の8つの幻覚評価スイートでの性能向上、および6つのベンチマークにわたる一般的なマルチモーダル能力の向上が認められました。コード、ベンチマーク、モデルはhttps://explainableml.github.io/finer-project/ で公開されています。

English

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at https://explainableml.github.io/finer-project/{https://explainableml.github.io/finer-project/}.

FINER: 細粒度の否定的クエリにおけるMLLMの幻覚生成

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

要旨

Support