Chartographer: 視覚言語モデル評価のための反事実的チャート生成

要旨

チャート質問応答（QA）ベンチマークは、正しく回答するために視覚的推論を必要とする質問を提示することを目的としているが、モデルは自身の背景知識に基づいて、近道やチャートに対する事前の馴染みを通じて解答に到達することが多い。視覚的推論を厳密に評価するために、我々は反事実チャートを提案する。これは、チャートと質問のタスクは固定されたままであるが、基礎となるチャートと対応する回答が変更されるものである。我々はChartographerというフレームワークを導入する。これは、チャートを実行可能コードにリバースエンジニアリングし、再構築の忠実性を検証し、シード制御された反事実バリアントを生成し、実行可能なQAロジックから新たな回答を導出するものである。我々はこのフレームワークを既存のチャートQAデータセットに適用し、プロプライエタリおよびオープンソースの視覚言語モデル（VLM）を評価し、変動感度と一般化可能性を測定する。反事実チャートは、単一チャート性能では隠されていた失敗を明らかにする。すなわち、VLMは元のチャートに正しく回答した後でも一般化に失敗することが多い。特に、更新されたチャートが新たな視覚的推論経路を必要とする場合に、失敗が最も顕著であることがわかった。

English

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.