Woodpecker: マルチモーダル大規模言語モデルのための幻覚補正

要旨

幻覚（Hallucination）は、急速に進化するマルチモーダル大規模言語モデル（MLLMs）に影を落とす大きな課題であり、生成されたテキストが画像の内容と一致しない現象を指します。幻覚を軽減するために、既存の研究では主に特定のデータを用いてモデルを再訓練する指示チューニング（instruction-tuning）の手法が採用されています。本論文では、異なるアプローチを提案し、訓練不要の方法である「Woodpecker」を紹介します。Woodpeckerは、キツツキが木を治すように、生成されたテキストから幻覚を選び出し修正します。具体的には、Woodpeckerは5つの段階で構成されます：キーコンセプト抽出、質問形成、視覚的知識検証、視覚的主張生成、そして幻覚修正です。事後修復（post-remedy）方式で実装されたWoodpeckerは、異なるMLLMsに容易に適用可能であり、5つの段階の中間出力にアクセスすることで解釈可能です。我々はWoodpeckerを定量的および定性的に評価し、この新しいパラダイムの大きな可能性を示します。POPEベンチマークでは、我々の手法はベースラインのMiniGPT-4/mPLUG-Owlに対して精度で30.66%/24.33%の改善を達成しました。ソースコードはhttps://github.com/BradyFU/Woodpeckerで公開されています。

English

Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

Woodpecker: マルチモーダル大規模言語モデルのための幻覚補正

Woodpecker: Hallucination Correction for Multimodal Large Language Models

要旨

Support