視覺化程式設計:圖表理解中的「代碼即思維」指南
Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
September 11, 2025
作者: Bohao Tang, Yan Ma, Fei Zhang, Jiadi Su, Ethan Chern, Zhulin Hu, Zhixin Wang, Pengfei Liu, Ya Zhang
cs.AI
摘要
圖表理解對視覺語言模型(VLMs)的推理能力提出了嚴峻考驗。先前的方法存在關鍵限制:一些依賴外部工具,使其脆弱且受預定義工具集約束;另一些則微調專用模型,這些模型通常採用單一推理策略,如基於文本的思維鏈(CoT)。基於文本推理的中間步驟難以驗證,這使得利用獎勵事實準確性的強化學習信號變得複雜。為解決這一問題,我們提出了一種“代碼即思維”(CaT)方法,將圖表的視覺信息以可驗證的符號格式表示。我們的核心洞見是,這一策略必須具備適應性:固定的純代碼實現在符號表示不適用的複雜圖表上始終失敗。這一發現促使我們引入視覺可編程性:一種可學習的屬性,用於判斷圖表-問題對是否更適合用代碼或直接視覺分析來解決。我們在一個自適應框架中實現了這一概念,其中VLM學習在CaT路徑和直接視覺推理路徑之間進行選擇。模型的選擇策略通過強化學習訓練,採用了一種新穎的雙重獎勵系統。該系統結合了數據準確性獎勵,使模型基於事實並防止數值幻覺,以及決策獎勵,教會模型何時使用每種策略,防止其默認採用單一推理模式。實驗在多樣化的圖表理解基準上展示了強勁且穩健的性能。我們的工作表明,VLMs不僅可以被教會如何推理,還能動態地為每項任務選擇最佳推理路徑。
English
Chart understanding presents a critical test to the reasoning capabilities of
Vision-Language Models (VLMs). Prior approaches face critical limitations: some
rely on external tools, making them brittle and constrained by a predefined
toolkit, while others fine-tune specialist models that often adopt a single
reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate
steps of text-based reasoning are difficult to verify, which complicates the
use of reinforcement-learning signals that reward factual accuracy. To address
this, we propose a Code-as-Thought (CaT) approach to represent the visual
information of a chart in a verifiable, symbolic format. Our key insight is
that this strategy must be adaptive: a fixed, code-only implementation
consistently fails on complex charts where symbolic representation is
unsuitable. This finding leads us to introduce Visual Programmability: a
learnable property that determines if a chart-question pair is better solved
with code or direct visual analysis. We implement this concept in an adaptive
framework where a VLM learns to choose between the CaT pathway and a direct
visual reasoning pathway. The selection policy of the model is trained with
reinforcement learning using a novel dual-reward system. This system combines a
data-accuracy reward to ground the model in facts and prevent numerical
hallucination, with a decision reward that teaches the model when to use each
strategy, preventing it from defaulting to a single reasoning mode. Experiments
demonstrate strong and robust performance across diverse chart-understanding
benchmarks. Our work shows that VLMs can be taught not only to reason but also
how to reason, dynamically selecting the optimal reasoning pathway for each
task.