Woodpecker : Correction des hallucinations pour les grands modèles de langage multimodaux

papers.abstract

L'hallucination est une ombre importante qui plane sur les modèles de langage multimodaux de grande envergure (MLLMs) en évolution rapide, faisant référence au phénomène où le texte généré est incohérent avec le contenu de l'image. Pour atténuer les hallucinations, les études existantes recourent principalement à une méthode d'ajustement par instruction qui nécessite de réentraîner les modèles avec des données spécifiques. Dans cet article, nous empruntons une voie différente en introduisant une méthode sans entraînement nommée Woodpecker. Comme un pic-vert soigne les arbres, elle identifie et corrige les hallucinations dans le texte généré. Concrètement, Woodpecker se compose de cinq étapes : extraction des concepts clés, formulation de questions, validation des connaissances visuelles, génération d'affirmations visuelles et correction des hallucinations. Implémentée de manière post-réparation, Woodpecker peut facilement servir différents MLLMs tout en étant interprétable grâce à l'accès aux sorties intermédiaires des cinq étapes. Nous évaluons Woodpecker à la fois quantitativement et qualitativement et montrons le potentiel énorme de ce nouveau paradigme. Sur le benchmark POPE, notre méthode obtient une amélioration de 30,66 %/24,33 % en précision par rapport au modèle de référence MiniGPT-4/mPLUG-Owl. Le code source est disponible à l'adresse https://github.com/BradyFU/Woodpecker.

English

Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

Woodpecker : Correction des hallucinations pour les grands modèles de langage multimodaux

Woodpecker: Hallucination Correction for Multimodal Large Language Models

papers.abstract

Support