FINER: I MLLM Allucinano con Query Negative a Grana Fine

Abstract

I modelli linguistici multimodali di grandi dimensioni (MLLM) presentano difficoltà con le allucinazioni, in particolare con query a grana fine, una sfida sottorappresentata dai benchmark esistenti che si concentrano su domande grossolane relative all'immagine. Introduciamo FIne-grained NEgative queRies (FINER), insieme a due benchmark: FINER-CompreCap e FINER-DOCCI. Utilizzando FINER, analizziamo le allucinazioni in quattro contesti: domande su oggetti multipli, attributi multipli, relazioni multiple e domande "cosa". I nostri benchmark rivelano che gli MLLM allucinano quando disallineamenti a grana fine co-occorrono con elementi genuinamente presenti nell'immagine. Per affrontare questo problema, proponiamo FINER-Tuning, sfruttando l'Optimizzazione della Preferenza Diretta (DPO) su dati ispirati a FINER. Il fine-tuning di quattro MLLM all'avanguardia con FINER-Tuning produce guadagni fino al 24,2% (InternVL3.5-14B) sulle allucinazioni dai nostri benchmark, migliorando simultaneamente le prestazioni su otto suite di allucinazione esistenti e potenziando le capacità multimodali generali su sei benchmark. Codice, benchmark e modelli sono disponibili all'indirizzo https://explainableml.github.io/finer-project/.

English

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at https://explainableml.github.io/finer-project/{https://explainableml.github.io/finer-project/}.

FINER: I MLLM Allucinano con Query Negative a Grana Fine

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Abstract

Support