MLLMは見ることができますか？幻覚緩和のための動的修正デコーディング

要旨

マルチモーダル大規模言語モデル（MLLMs）は、頻繁に幻覚現象を示しますが、その根本的な理由は依然として理解されていません。本論文では、経験的な分析を行い、MLLMsが最終出力でオブジェクトを誤って生成する一方で、実際には前段のレイヤーで視覚オブジェクトを認識できることを発見しました。言語モデルの強力な知識事前分布が視覚情報を抑制し、幻覚を引き起こす可能性があると推測しています。このことに着想を得て、MLLMs向けの新しい動的補正デコーディング手法（DeCo）を提案します。DeCoは、適切な前段のレイヤーを選択し、知識を最終レイヤーに比例して統合して出力ロジットを調整する方法です。DeCoはモデルに依存せず、さまざまな古典的なデコーディング戦略とシームレスに組み合わせることができ、さまざまなMLLMsに適用できます。DeCoを広く使用されているベンチマークで評価し、基準線と比較して幻覚率を大幅に低減できることを示し、幻覚を緩和する潜在能力を強調します。コードはhttps://github.com/zjunlp/DeCoで入手可能です。

English

Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the language model suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs (DeCo), which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at https://github.com/zjunlp/DeCo.

MLLMは見ることができますか？幻覚緩和のための動的修正デコーディング

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

要旨

Support