ChatPaper.aiChatPaper

编程视觉思维:迈向图像思维的统一视角

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

December 3, 2025
作者: Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin
cs.AI

摘要

能够通过图像进行思考的多模态大语言模型(MLLMs)可借助工具交互式地对视觉输入进行推理,但现有方法通常依赖工具集过于狭窄,既缺乏实际必要性又难以扩展。本研究首先揭示了一个关键且长期被忽视的缺陷:即使最先进的MLLMs也表现出惊人的脆弱性,在简单方向调整或自然干扰下的图像上会出现显著性能退化,这凸显了发展更鲁棒的工具推理能力的必要性。为此,我们提出CodeVision——一个灵活可扩展的代码即工具框架,通过生成代码作为调用任意图像操作的通用接口,突破固定工具注册表的限制。我们采用两阶段训练方法:首先在专为复杂多轮工具组合与错误恢复构建的高质量数据集上进行监督微调(SFT),随后通过具有新颖密集过程奖励函数的强化学习(RL)策略性提升工具使用效率。为推进相关研究,我们构建了全新的SFT与RL数据集,并推出具有挑战性的基准测试套件,用于系统评估模型对方向变化的鲁棒性及多工具推理能力。在Qwen2.5-VL和Qwen3-VL系列模型上的实验表明,我们的方法显著提升了模型性能,并催生了灵活工具组合、高效链式执行、基于运行时反馈的鲁棒错误恢复等新兴能力。代码已开源:https://github.com/ByteDance-BandAI/CodeVision。
English
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.
PDF100December 5, 2025