Chartographer：用於評估視覺語言模型的反事實圖表生成

摘要

圖表問答（Chart QA）基準旨在提出需要視覺推理才能正確回答的問題，但模型往往能透過捷徑或自身背景知識對圖表的既有熟悉度來得出解答。為了嚴格評估視覺推理，我們提出反事實圖表（counterfactual charts），其中圖表問題任務保持不變，但底層圖表及其對應答案則有所變動。我們引入 Chartographer 框架，能將圖表逆向工程為可執行代碼、驗證重建保真度、生成種子控制的反事實變體，並從可執行的問答邏輯中推導出新答案。我們將此框架應用於現有圖表問答資料集，並評估專有與開源的視覺語言模型（VLM），測量其變異敏感性與泛化能力。反事實圖表揭示了單圖表表現所隱藏的失敗：VLM 在正確回答原始圖表後，往往無法進行泛化。我們發現，當更新後的圖表需要全新的視覺推理途徑時，失敗情況最為普遍。

English

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.