可视化编程：图表中“代码即思维”指南

摘要

图表理解对视觉-语言模型（VLMs）的推理能力提出了严峻考验。先前的方法存在显著局限：一些依赖外部工具，使其脆弱且受限于预定义的工具集；另一些则微调专用模型，这些模型通常采用单一的推理策略，如基于文本的链式思维（CoT）。基于文本推理的中间步骤难以验证，这增加了利用奖励事实准确性的强化学习信号的复杂性。为解决这一问题，我们提出了一种“代码即思维”（CaT）方法，将图表的视觉信息以可验证的符号格式表示。我们的核心见解是，这一策略必须具备适应性：固定的纯代码实现在符号表示不适宜的复杂图表上屡屡失败。这一发现促使我们引入“视觉可编程性”：一种可学习的属性，用于判断图表-问题对更适合通过代码还是直接视觉分析来解决。我们在一个自适应框架中实现了这一概念，其中VLM学习在CaT路径与直接视觉推理路径之间做出选择。模型的选择策略通过一种新颖的双重奖励系统进行强化学习训练。该系统结合了数据准确性奖励，使模型基于事实并防止数值幻觉，以及决策奖励，教导模型何时使用每种策略，避免其默认单一推理模式。实验表明，在多样化的图表理解基准测试中，我们的方法展现出强大且稳健的性能。我们的工作表明，VLM不仅能被教会如何推理，还能动态选择每项任务的最优推理路径。

English

Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.

可视化编程：图表中“代码即思维”指南

Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

摘要

Support