Chartographer: 시각-언어 모델 평가를 위한 반사실적 차트 생성

초록

차트 질의응답(QA) 벤치마크는 올바른 답변을 위해 시각적 추론이 필요한 질문을 제시하는 것을 목표로 하지만, 모델은 종종 자신의 배경 지식을 바탕으로 한 단축 경로나 차트에 대한 사전 친숙성을 통해 해결책에 도달할 수 있습니다. 시각적 추론을 엄격히 평가하기 위해, 우리는 차트-질문 과제는 고정되어 있지만 기본 차트와 해당 답변은 변화하는 반사실적 차트를 제안합니다. 우리는 차트를 실행 가능한 코드로 역설계하고, 재구성 충실도를 검증하며, 시드 제어된 반사실적 변형을 생성하고, 실행 가능한 QA 논리로부터 새로운 답변을 도출하는 프레임워크인 Chartographer를 소개합니다. 우리는 이 프레임워크를 기존 차트 QA 데이터셋에 적용하고, 독점 및 오픈소스 비전-언어 모델(VLM)을 평가하여 변형 민감도와 일반화 능력을 측정합니다. 반사실적 차트는 단일 차트 성능에 가려진 실패를 드러냅니다: VLM은 원본 차트를 올바르게 답변한 후에도 일반화에 실패하는 경우가 많습니다. 우리는 업데이트된 차트가 새로운 시각적 추론 경로를 필요로 할 때 실패가 가장 널리 발생한다는 것을 발견했습니다.

English

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.