CodePercept: MLLM을 위한 코드 기반 시각 STEM 인식

초록

MLLM(다중언어언어모델)이 STEM(과학, 기술, 공학, 수학) 시각적 추론에 실패할 때 근본적인 의문이 제기된다: 이는 지각 결함 때문인가, 아니면 추론의 한계 때문인가? 지각과 추론 구성 요소를 독립적으로 확장하는 체계적인 확장 분석을 통해 우리는 중요한 통찰력을 발견했다: 지각 확장이 일관되게 추론 확장을 능가한다는 것이다. 이는 지각이 현재 STEM 시각적 추론을 제한하는 진정한 핵심 요소임을 보여준다. 이러한 통찰에 기반하여, 우리의 연구는 코드를 강력한 지각 매체로 확립함으로써 MLLM의 지각 능력을 체계적으로 향상시키는 데 중점을 둔다. 실행 가능한 코드는 STEM 시각 자료의 구조화된 특성과 자연스럽게 조응하는 정밀한 의미론을 제공한다. 구체적으로, 우리는 두 가지 상호 보완적 접근법을 통해 이 코드-지각 패러다임을 구현하는 100만 개의 이미지-캡션-코드 삼중항으로 구성된 대규모 데이터셋 ICC-1M을 구축했다: (1) 코드 기반 캡션 생성은 실행 가능한 코드를 이미지 캡션의 기준 진실로 간주하여 기존 지식 증류 방법에 내재된 환각을 제거한다; (2) STEM 이미지-코드 변환은 모델이 재구성 코드를 생성하도록 유도하여 지각 향상을 위한 자연어의 모호성을 완화한다. 이 패러다임을 검증하기 위해, 우리는 STEM 영역에서 시각적 지각을 직접 평가하는 새로운 벤치마크인 STEM2Code-Eval을 추가로 도입한다. 문제 관련 이해만을 측정하는 대리 지표로 문제 해결 정확도에 의존하는 기존 연구와 달리, 우리의 벤치마크는 이미지 재구성을 위한 실행 가능한 코드 생성을 통해 포괄적인 시각적 이해를 요구하며, 결정론적이고 검증 가능한 평가를 제공한다. 코드는 https://github.com/TongkunGuan/Qwen-CodePercept 에서 이용 가능하다.

English

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

CodePercept: MLLM을 위한 코드 기반 시각 STEM 인식

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

초록

Support