ChartM^3: 다차원 및 다단계 차트 이해 시각적 추론 데이터 구축을 위한 코드 기반 다단계 파이프라인

초록

복잡한 차트 이해 작업은 다중 모달 대규모 언어 모델(MLLM)의 고급 시각 인식 및 추론 능력을 요구합니다. 그러나 현재 연구는 실제 응용 분야에서 흔히 나타나는 복잡한 차트 시나리오와 계산 집약적 추론 작업을 제한적으로 다루고 있습니다. 본 연구는 이러한 한계를 해결하기 위해 체계적으로 시각 추론 데이터셋을 생성하는 자동화된 다단계 코드 기반 파이프라인을 제안합니다. 이 파이프라인은 검증된 차트 템플릿을 검색하기 위해 검색 증강 생성(RAG)을 통합하고, 실제 데이터 분포를 시뮬레이션하는 추론 코드를 생성하기 위해 사고 연쇄(CoT) 전략을 활용하여 차트 렌더링 및 질문 관련 통계 계산을 수행합니다. 모델 기반 평가를 통해 이 파이프라인은 차트 다양성과 데이터 품질을 향상시킵니다. 본 프레임워크를 이용해 우리는 훈련용 38,000개 차트와 142,000개의 질문-답변 쌍으로 구성된 다차원 및 다단계 데이터셋인 ChartM^3과 실질적인 성능 평가를 위한 고품질 평가 샘플 2,871개를 구축했습니다. 지도 미세 조정(SFT) 및 강화 학습(RL) 실험을 통해 우리의 데이터셋이 추론 능력과 교차 도메인 일반화 성능을 크게 향상시키며, 더 작은 규모의 모델이 복잡한 차트 이해 작업에서 대규모 모델에 필적하는 성능을 달성할 수 있게 함을 입증했습니다.

English

Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM^3, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

ChartM^3: 다차원 및 다단계 차트 이해 시각적 추론 데이터 구축을 위한 코드 기반 다단계 파이프라인

ChartM^3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

초록

Support