同心因果注意を介して物体の幻覚を軽減する

要旨

最近の大規模ビジョン言語モデル（LVLMs）は、多様なクエリを与えられた際に顕著なゼロショットの会話および推論能力を示しています。しかしながら、LVLMsはオブジェクトの幻覚と呼ばれる現象に苦しんでおり、これはLVLMsが画像入力と事実に合致しないテキスト応答を生成しやすいというものです。私たちのパイロット研究では、オブジェクトの幻覚が、既存のLVLMsで広く採用されている位置依存モデリングデザインであるRotary Position Encoding（RoPE）と密接に関連していることが明らかになりました。RoPEにおける長期的な減衰のため、LVLMsは、マルチモーダル入力シーケンス内の指示トークンから関連する視覚的手がかりが遠くにある場合に、より幻覚を起こしやすくなります。さらに、マルチモーダルアライメント中に視覚トークンのシーケンシャル順序を逆転させた場合にも同様の効果を観察します。私たちのテストは、RoPEにおける長期的な減衰が、LVLMsが長距離を超えた視覚と指示の相互作用を捉える際に課題を提起することを示しています。私たちは、Concentric Causal Attention（CCA）を提案し、RoPEの長期的な減衰の影響を軽減するためのシンプルで効果的な位置合わせ戦略であり、これにより視覚トークンと指示トークンの相対距離を自然に縮小させます。CCAにより、視覚トークンは指示トークンとより良く相互作用し、モデルの認識能力を向上させ、オブジェクトの幻覚を和らげることができます。余計な装飾をせず、私たちの位置合わせ方法は、複数のオブジェクト幻覚ベンチマークにおいて、既存の幻覚軽減戦略を大幅に上回っています。

English

Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual responses not factually aligned with image inputs. Our pilot study reveals that object hallucination is closely tied with Rotary Position Encoding (RoPE), a widely adopted positional dependency modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs tend to hallucinate more when relevant visual cues are distant from instruction tokens in the multimodal input sequence. Additionally, we observe a similar effect when reversing the sequential order of visual tokens during multimodal alignment. Our tests indicate that long-term decay in RoPE poses challenges to LVLMs while capturing visual-instruction interactions across long distances. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens. With CCA, visual tokens can better interact with instruction tokens, thereby enhancing model's perception capability and alleviating object hallucination. Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.

同心因果注意を介して物体の幻覚を軽減する

Mitigating Object Hallucination via Concentric Causal Attention

要旨

Support