啄木鳥：用於多模式大型語言模型的幻覺修正

摘要

幻覺是懸掛在快速演進的多模式大型語言模型（MLLMs）上的一個巨大陰影，指生成的文本與圖像內容不一致的現象。為了減輕幻覺，現有研究主要採用一種指導調整方式，需要使用特定數據對模型進行重新訓練。在本文中，我們開創了一條不同的道路，引入了一種名為「啄木鳥」的無需訓練的方法。就像啄木鳥修復樹木一樣，它可以挑出並糾正生成文本中的幻覺。具體而言，啄木鳥包括五個階段：關鍵概念提取、問題制定、視覺知識驗證、視覺主張生成和幻覺糾正。啄木鳥以事後矯正的方式實施，可以輕鬆應用於不同的MLLMs，同時通過訪問五個階段的中間輸出來實現可解釋性。我們從定量和定性兩方面評估了啄木鳥，展示了這種新範式的巨大潛力。在POPE基準測試中，我們的方法在準確性方面相對於基準MiniGPT-4/mPLUG-Owl分別獲得了30.66%/24.33%的改進。源代碼已發布在https://github.com/BradyFU/Woodpecker。

English

Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

啄木鳥：用於多模式大型語言模型的幻覺修正

Woodpecker: Hallucination Correction for Multimodal Large Language Models

摘要

Support