PixelCraft: 구조화된 이미지에 대한 고해상도 시각적 추론을 위한 다중 에이전트 시스템

초록

구조화된 이미지(예: 차트 및 기하학적 다이어그램)는 다중모드 대형 언어 모델(MLLMs)에게 여전히 어려운 과제로 남아 있습니다. 왜냐하면 지각적 오류가 잘못된 결론으로 이어질 수 있기 때문입니다. 중간 시각적 단서는 추론을 이끌 수 있지만, 기존의 단서 기반 방법은 저해상도 이미지 처리와 선형적이고 경직된 추론 패턴에 제한되어 있어 복잡한 구조화된 이미지 작업에서의 효과가 제한적입니다. 본 논문에서는 구조화된 이미지에 대한 고해상도 이미지 처리와 유연한 시각적 추론을 위한 새로운 다중 에이전트 시스템인 PixelCraft를 제안합니다. 이 시스템은 디스패처, 플래너, 추론자, 비평가 및 일련의 시각적 도구 에이전트로 구성됩니다. 고해상도 처리를 위해 고품질 코퍼스를 구축하고 MLLM을 기반으로 한 그라운딩 모델을 미세 조정하여, 픽셀 수준의 위치 정보를 도구 에이전트 내의 전통적인 컴퓨터 비전(CV) 알고리즘과 통합합니다. 이를 바탕으로 PixelCraft는 도구 선택, 에이전트 토론 및 자기 비평의 동적 3단계 워크플로를 통해 유연한 시각적 추론을 가능하게 합니다. 또한, 단순히 과거 이미지를 추가하는 기존의 선형적 추론 패턴과 달리, PixelCraft는 이미지 메모리를 유지하여 플래너가 이전 시각적 단계를 적응적으로 재검토하고, 대체 추론 분기를 탐색하며, 토론 중에 추론 궤적을 동적으로 조정할 수 있도록 합니다. 도전적인 차트 및 기하학적 벤치마크에 대한 광범위한 실험을 통해 PixelCraft가 고급 MLLM의 시각적 추론 성능을 크게 향상시키며, 구조화된 이미지 추론에 대한 새로운 표준을 설정함을 입증했습니다. 우리의 코드는 https://github.com/microsoft/PixelCraft에서 확인할 수 있습니다.

English

Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.

PixelCraft: 구조화된 이미지에 대한 고해상도 시각적 추론을 위한 다중 에이전트 시스템

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

초록

Support