MINT-CoT: 수학적 사고 과정 추론에서 인터리브된 시각적 토큰의 활성화

초록

Chain-of-Thought(CoT)는 대규모 언어 모델(LLMs)의 수학적 추론 능력을 크게 향상시켰지만, 이를 다중 모달 영역으로 확장하는 것은 여전히 어려운 과제로 남아 있다. 기존 연구들은 이미지 입력에 대해 유사한 텍스트 기반 추론을 적용하거나, 수학적 CoT에 시각적 신호를 교차적으로 삽입하는 방법을 모색해왔다. 그러나 이러한 접근법들은 수학 문제 해결에 있어 세 가지 주요 한계에 직면한다: 거친 박스 형태의 이미지 영역에 대한 의존성, 수학 콘텐츠에 대한 시각 인코더의 제한된 인식 능력, 시각적 수정을 위한 외부 기능에 대한 의존성. 본 논문에서는 이러한 한계를 극복하기 위해 MINT-CoT를 제안한다. MINT-CoT는 수학적 교차 토큰(Mathematical INterleaved Tokens)을 도입하여 Chain-of-Thought 시각적 추론을 가능하게 한다. MINT-CoT는 Interleave Token을 통해 텍스트 추론 단계에 관련 시각적 토큰을 적응적으로 교차 삽입하며, 이 토큰은 수학 도형 내에서 임의의 형태의 시각적 영역을 동적으로 선택한다. 이러한 기능을 지원하기 위해, 우리는 각 추론 단계를 토큰 수준에서 시각적 영역과 정렬한 54K개의 수학 문제를 포함하는 MINT-CoT 데이터셋을 구축하고, 엄격한 데이터 생성 파이프라인을 함께 제공한다. 또한, 텍스트 전용 CoT SFT, 교차 CoT SFT, 교차 CoT RL을 점진적으로 결합한 3단계 MINT-CoT 훈련 전략을 제시하여 MINT-CoT-7B 모델을 도출한다. 광범위한 실험을 통해 우리의 방법이 수학 영역에서 효과적인 시각적 교차 추론을 가능하게 함을 입증하였으며, MINT-CoT-7B는 MathVista에서 +34.08%, GeoQA에서 +28.78%, MMStar에서 +23.2%로 기준 모델을 능가하는 성능을 보였다. 우리의 코드와 데이터는 https://github.com/xinyan-cxy/MINT-CoT에서 확인할 수 있다.

English

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

MINT-CoT: 수학적 사고 과정 추론에서 인터리브된 시각적 토큰의 활성화

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

초록

Support