Mitigando Alucinações de Objetos e Ações em MLLMs Multimodais por Meio de Alinhamento Contrastivo Auto-Aumentado

Resumo

Os recentes avanços em LLMs multimodais (MLLMs) demonstraram sua notável capacidade de gerar legendas descritivas para vídeos de entrada. No entanto, esses modelos sofrem com imprecisões factuais nas descrições geradas, causando sérios problemas de alucinação. Embora trabalhos anteriores tenham explorado a mitigação de alucinações para imagens estáticas, a mitigação conjunta de alucinações de objetos visuais e de ações temporais para vídeos dinâmicos permanece uma tarefa desafiadora e não resolvida. Para enfrentar esse desafio, propomos uma estrutura de Alinhamento Contrastivo Auto-Aumentado (SANTA) para garantir a fidelidade de objetos e ações, isentando correlações espúrias e reforçando a ênfase nos fatos visuais. O SANTA emprega um esquema de auto-aumento alucinativo para identificar as alucinações potenciais presentes no MLLM e transformar as legendas originais em negativas contrastadas. Além disso, desenvolvemos um alinhamento contrastivo de trilha-frase para corresponder os objetos regionais e as ações guiadas por relações com suas frases visuais e temporais correspondentes. Experimentos extensivos demonstram que o SANTA supera os métodos existentes na mitigação de alucinações de objetos e ações, produzindo desempenho superior em benchmarks de exame de alucinação.

English

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

Mitigando Alucinações de Objetos e Ações em MLLMs Multimodais por Meio de Alinhamento Contrastivo Auto-Aumentado

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Resumo

Support