CC-VQA: 지식 기반 시각 질의응답에서 지식 충돌 완화를 위한 갈등 및 상관관계 인식 방법

초록

지식 기반 시각 질의응답(KB-VQA)은 지식 집약적 작업 처리에 상당한 잠재력을 보여준다. 그러나 시각 언어 모델(VLM)의 정적 파라미터 지식과 사전 학습된 정적 모델 지식으로 인해 동적으로 검색된 정보 간에 충돌이 발생한다. 모델 출력은 검색된 맥락을 무시하거나 파라미터 지식과 불일치된 통합을 보여 KB-VQA에 상당한 과제를 제기한다. 현재 지식 충돌 완화 방법은 주로 언어 기반 접근법에서 도입되어, 엔지니어링된 프롬프트 전략이나 맥락 인식 디코딩 메커니즘을 통해 맥락 수준 충돌에 집중한다. 그러나 이러한 방법들은 충돌에서 시각 정보의 중요한 역할을 간과하며, 정확한 충돌 식별과 효과적 완화를 저해하는 과도한 검색 맥락 문제를 안고 있다. 이러한 한계를 해결하기 위해 우리는 KB-VQA를 위한 새로운 학습 없는(train-free) 충돌 및 상관관계 인식 방법인 CC-VQA를 제안한다. 우리 방법은 두 가지 핵심 구성 요소로 이루어진다: (1) 내부 및 외부 지식 맥락 간 시각-의미 충돌 분석을 수행하는 시각 중심 맥락 충돌 추론, (2) 낮은 상관관계 문장에 대한 위치 인코딩 압축과 상관관계 가중 충돌 점수를 활용한 적응형 디코딩을 특징으로 하는 상관관계 기반 인코딩 및 디코딩. E-VQA, InfoSeek, OK-VQA 벤치마크에서의 광범위한 평가 결과, CC-VQA가 기존 방법 대비 3.3%~6.4%의 절대 정확도 향상을 달성하며 최첨단 성능을 보여줌을 확인했다. 코드는 https://github.com/cqu-student/CC-VQA에서 확인할 수 있다.

English

Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose CC-VQA: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.

CC-VQA: 지식 기반 시각 질의응답에서 지식 충돌 완화를 위한 갈등 및 상관관계 인식 방법

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

초록

Support