Chartographer: 面向视觉-语言模型评估的反事实图表生成

摘要

图表问答（QA）基准旨在提出需要视觉推理才能正确回答的问题，但模型往往能够通过捷径或基于自身背景知识对图表的预先熟悉来得出答案。为严格评估视觉推理能力，我们提出了反事实图表，其中图表问答任务保持不变，但底层图表及相应答案发生改变。我们介绍了Chartographer框架，该框架能够将图表逆向工程为可执行代码，验证重建保真度，生成种子控制的反事实变体，并通过可执行的问答逻辑推导出新答案。我们将该框架应用于现有图表QA数据集，并评估了专有和开源视觉语言模型（VLM），衡量变异敏感性和泛化能力。反事实图表揭示了单图表评估所隐藏的失败：VLM在正确回答原始图表问题后往往无法泛化。我们发现，当更新后的图表需要全新的视觉推理路径时，失败最为常见。

English

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.