PixelCraft：一個基於多代理系統的高保真視覺推理框架，專注於結構化圖像分析

摘要

結構化圖像（如圖表和幾何圖表）對於多模態大型語言模型（MLLMs）仍然具有挑戰性，因為感知上的失誤可能會導致錯誤的結論。中間視覺線索可以引導推理；然而，現有的基於線索的方法受限於低保真度的圖像處理和線性、僵化的推理模式，限制了它們在複雜結構化圖像任務上的有效性。在本文中，我們提出了PixelCraft，這是一個用於高保真度圖像處理和靈活視覺推理的新穎多代理系統。該系統包括一個調度器、一個規劃器、一個推理器、批評者和一組視覺工具代理。為了實現高保真度的處理，我們構建了一個高質量的語料庫，並將一個MLLM微調為一個基礎模型，其像素級定位與傳統計算機視覺（CV）算法在工具代理中集成。在此基礎上，PixelCraft通過動態的三階段工作流程（工具選擇、代理討論和自我批評）促進靈活的視覺推理。此外，與之前簡單附加歷史圖像的線性推理模式不同，PixelCraft維護了一個圖像記憶，使規劃器能夠自適應地重新審視早期的視覺步驟，探索替代的推理分支，並在討論過程中動態調整推理軌跡。在具有挑戰性的圖表和幾何基準上的大量實驗表明，PixelCraft顯著提高了高級MLLMs的視覺推理性能，為結構化圖像推理設定了新標準。我們的代碼將在https://github.com/microsoft/PixelCraft上提供。

English

Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.

PixelCraft：一個基於多代理系統的高保真視覺推理框架，專注於結構化圖像分析

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

摘要

Support