PixelCraft:面向结构化图像高保真视觉推理的多智能体系统
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
September 29, 2025
作者: Shuoshuo Zhang, Zijian Li, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Yujiu Yang, Rui Wang
cs.AI
摘要
结构化图像(如图表和几何图示)对于多模态大语言模型(MLLMs)而言仍具挑战性,因为感知上的失误可能引发一连串错误结论。中间视觉线索虽能引导推理,但现有的基于线索的方法受限于低保真度图像处理及线性、僵化的推理模式,在处理复杂结构化图像任务时效果有限。本文提出PixelCraft,一种新颖的多代理系统,专为高保真图像处理及灵活视觉推理而设计,适用于结构化图像。该系统包含调度器、规划器、推理器、批评者及一组视觉工具代理。为实现高保真处理,我们构建了一个高质量语料库,并将一个MLLM微调为基础模型,其像素级定位与工具代理中的传统计算机视觉(CV)算法相结合。在此基础上,PixelCraft通过工具选择、代理讨论及自我批评的动态三阶段工作流,促进灵活视觉推理。此外,与以往简单附加历史图像的线性推理模式不同,PixelCraft维护一个图像记忆,使规划器能在讨论过程中自适应地回顾早期视觉步骤、探索替代推理分支,并动态调整推理轨迹。在具有挑战性的图表和几何基准测试上的广泛实验表明,PixelCraft显著提升了高级MLLMs的视觉推理性能,为结构化图像推理设立了新标准。我们的代码将发布于https://github.com/microsoft/PixelCraft。
English
Structured images (e.g., charts and geometric diagrams) remain challenging
for multimodal large language models (MLLMs), as perceptual slips can cascade
into erroneous conclusions. Intermediate visual cues can steer reasoning;
however, existing cue-based methods are constrained with low-fidelity image
processing and linear, rigid reasoning patterns, limiting their effectiveness
on complex structured-image tasks. In this paper, we propose PixelCraft, a
novel multi-agent system for high-fidelity image processing and flexible visual
reasoning on structured images. The system comprises a dispatcher, a planner, a
reasoner, critics, and a set of visual tool agents. To achieve high-fidelity
processing, we construct a high-quality corpus and fine-tune an MLLM into a
grounding model, whose pixel-level localizations are integrated with
traditional computer vision (CV) algorithms in tool agents. Building on this
foundation, PixelCraft facilitates flexible visual reasoning through a dynamic
three-stage workflow of tool selection, agent discussion, and self-criticism.
Moreover, unlike prior linear reasoning patterns that simply append historical
images, PixelCraft maintains an image memory to allow the planner to adaptively
revisit earlier visual steps, explore alternative reasoning branches, and
dynamically adjust the reasoning trajectory during discussion. Extensive
experiments on challenging chart and geometry benchmarks demonstrate that
PixelCraft significantly improves visual reasoning performance for advanced
MLLMs, setting a new standard for structured image reasoning. Our code will be
available at https://github.com/microsoft/PixelCraft.