PixelCraft: 構造化画像における高精細視覚推論のためのマルチエージェントシステム

要旨

構造化された画像（例えば、チャートや幾何学的図形）は、マルチモーダル大規模言語モデル（MLLM）にとって依然として課題であり、知覚的な誤りが誤った結論につながる可能性があります。中間的な視覚的キューは推論を導くことができますが、既存のキューに基づく手法は、低解像度の画像処理と線形的で硬直した推論パターンに制約されており、複雑な構造化画像タスクでの効果が限られています。本論文では、構造化画像に対する高解像度の画像処理と柔軟な視覚推論を実現する新しいマルチエージェントシステムであるPixelCraftを提案します。このシステムは、ディスパッチャー、プランナー、推論エージェント、批評者、および一連の視覚ツールエージェントで構成されています。高解像度の処理を実現するために、高品質のコーパスを構築し、MLLMをグラウンディングモデルにファインチューニングし、そのピクセルレベルの位置情報をツールエージェント内の従来のコンピュータビジョン（CV）アルゴリズムと統合します。この基盤を基に、PixelCraftは、ツール選択、エージェント間の議論、自己批評という動的な3段階のワークフローを通じて柔軟な視覚推論を促進します。さらに、単に過去の画像を追加する従来の線形推論パターンとは異なり、PixelCraftは画像メモリを維持し、プランナーが以前の視覚的ステップを適応的に再訪し、代替の推論ブランチを探索し、議論中に推論軌道を動的に調整できるようにします。チャートや幾何学のベンチマークでの広範な実験により、PixelCraftが先進的なMLLMの視覚推論性能を大幅に向上させ、構造化画像推論の新たな標準を確立することが示されました。私たちのコードはhttps://github.com/microsoft/PixelCraftで公開されます。

English

Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.

PixelCraft: 構造化画像における高精細視覚推論のためのマルチエージェントシステム

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

要旨

Support