ChatPaper.aiChatPaper

視覺化程式設計:圖表理解中的「代碼即思維」指南

Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

September 11, 2025
作者: Bohao Tang, Yan Ma, Fei Zhang, Jiadi Su, Ethan Chern, Zhulin Hu, Zhixin Wang, Pengfei Liu, Ya Zhang
cs.AI

摘要

圖表理解對視覺語言模型(VLMs)的推理能力提出了嚴峻考驗。先前的方法存在關鍵限制:一些依賴外部工具,使其脆弱且受預定義工具集約束;另一些則微調專用模型,這些模型通常採用單一推理策略,如基於文本的思維鏈(CoT)。基於文本推理的中間步驟難以驗證,這使得利用獎勵事實準確性的強化學習信號變得複雜。為解決這一問題,我們提出了一種“代碼即思維”(CaT)方法,將圖表的視覺信息以可驗證的符號格式表示。我們的核心洞見是,這一策略必須具備適應性:固定的純代碼實現在符號表示不適用的複雜圖表上始終失敗。這一發現促使我們引入視覺可編程性:一種可學習的屬性,用於判斷圖表-問題對是否更適合用代碼或直接視覺分析來解決。我們在一個自適應框架中實現了這一概念,其中VLM學習在CaT路徑和直接視覺推理路徑之間進行選擇。模型的選擇策略通過強化學習訓練,採用了一種新穎的雙重獎勵系統。該系統結合了數據準確性獎勵,使模型基於事實並防止數值幻覺,以及決策獎勵,教會模型何時使用每種策略,防止其默認採用單一推理模式。實驗在多樣化的圖表理解基準上展示了強勁且穩健的性能。我們的工作表明,VLMs不僅可以被教會如何推理,還能動態地為每項任務選擇最佳推理路徑。
English
Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.
PDF82September 12, 2025