CC-VQA: 知識ベース視覚質問応答における知識衝突を緩和するための競合・相関考慮型手法

要旨

知識ベース視覚質問応答（KB-VQA）は、知識集約型タスクを扱う上で大きな可能性を示している。しかし、視覚言語モデル（VLM）に内在する静的なパラメトリック知識と、動的に検索される情報との間に矛盾が生じる。これは、事前学習で獲得されたモデル知識が静的であることに起因する。その結果、出力が検索された文脈を無視したり、パラメトリック知識との統合に一貫性がなかったりするため、KB-VQAにとって大きな課題となっている。現在の知識矛盾緩和手法は、主に言語ベースの手法を応用したもので、エンジニアリングされたプロンプト戦略や文脈認識デコーディング機構を通じて、文脈レベルの矛盾に焦点を当てている。しかし、これらの手法は矛盾における視覚情報の重要性を軽視しており、冗長な検索文脈によって正確な矛盾の特定と効果的な緩和が妨げられるという問題がある。これらの限界に対処するため、我々はCC-VQAを提案する。これは、トレーニング不要で、矛盾と相関を考慮した新しいKB-VQA手法である。本手法は二つの核心的要素から構成される：(1) 視覚中心の文脈的矛盾推論。内部および外部の知識文脈にわたって視覚的・意味的な矛盾分析を行う。(2) 相関誘導型エンコーディング・デコーディング。相関性の低い記述に対する位置エンコーディング圧縮と、相関重み付き矛盾スコアリングを用いた適応的デコーディングを特徴とする。E-VQA、InfoSeek、OK-VQAベンチマークによる広範な評価により、CC-VQAが既存手法と比較して3.3%から6.4%の絶対精度向上を達成し、state-of-the-artの性能を実現することを示した。コードはhttps://github.com/cqu-student/CC-VQAで公開されている。

English

Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose CC-VQA: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.

CC-VQA: 知識ベース視覚質問応答における知識衝突を緩和するための競合・相関考慮型手法

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

要旨

Support