可视化编程:图表中“代码即思维”指南
Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
September 11, 2025
作者: Bohao Tang, Yan Ma, Fei Zhang, Jiadi Su, Ethan Chern, Zhulin Hu, Zhixin Wang, Pengfei Liu, Ya Zhang
cs.AI
摘要
图表理解对视觉-语言模型(VLMs)的推理能力提出了严峻考验。先前的方法存在显著局限:一些依赖外部工具,使其脆弱且受限于预定义的工具集;另一些则微调专用模型,这些模型通常采用单一的推理策略,如基于文本的链式思维(CoT)。基于文本推理的中间步骤难以验证,这增加了利用奖励事实准确性的强化学习信号的复杂性。为解决这一问题,我们提出了一种“代码即思维”(CaT)方法,将图表的视觉信息以可验证的符号格式表示。我们的核心见解是,这一策略必须具备适应性:固定的纯代码实现在符号表示不适宜的复杂图表上屡屡失败。这一发现促使我们引入“视觉可编程性”:一种可学习的属性,用于判断图表-问题对更适合通过代码还是直接视觉分析来解决。我们在一个自适应框架中实现了这一概念,其中VLM学习在CaT路径与直接视觉推理路径之间做出选择。模型的选择策略通过一种新颖的双重奖励系统进行强化学习训练。该系统结合了数据准确性奖励,使模型基于事实并防止数值幻觉,以及决策奖励,教导模型何时使用每种策略,避免其默认单一推理模式。实验表明,在多样化的图表理解基准测试中,我们的方法展现出强大且稳健的性能。我们的工作表明,VLM不仅能被教会如何推理,还能动态选择每项任务的最优推理路径。
English
Chart understanding presents a critical test to the reasoning capabilities of
Vision-Language Models (VLMs). Prior approaches face critical limitations: some
rely on external tools, making them brittle and constrained by a predefined
toolkit, while others fine-tune specialist models that often adopt a single
reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate
steps of text-based reasoning are difficult to verify, which complicates the
use of reinforcement-learning signals that reward factual accuracy. To address
this, we propose a Code-as-Thought (CaT) approach to represent the visual
information of a chart in a verifiable, symbolic format. Our key insight is
that this strategy must be adaptive: a fixed, code-only implementation
consistently fails on complex charts where symbolic representation is
unsuitable. This finding leads us to introduce Visual Programmability: a
learnable property that determines if a chart-question pair is better solved
with code or direct visual analysis. We implement this concept in an adaptive
framework where a VLM learns to choose between the CaT pathway and a direct
visual reasoning pathway. The selection policy of the model is trained with
reinforcement learning using a novel dual-reward system. This system combines a
data-accuracy reward to ground the model in facts and prevent numerical
hallucination, with a decision reward that teaches the model when to use each
strategy, preventing it from defaulting to a single reasoning mode. Experiments
demonstrate strong and robust performance across diverse chart-understanding
benchmarks. Our work shows that VLMs can be taught not only to reason but also
how to reason, dynamically selecting the optimal reasoning pathway for each
task.