CodePercept：面向多模态大语言模型的代码驱动型视觉STEM感知框架

摘要

当多模态大语言模型（MLLMs）在科学、技术、工程和数学（STEM）视觉推理任务中表现不佳时，一个根本性问题随之产生：这究竟是源于感知缺陷还是推理局限？通过独立缩放感知与推理组件的系统性规模分析，我们获得关键发现：增强感知能力始终优于强化推理能力。这表明感知能力才是当前制约STEM视觉推理的真正瓶颈。基于此发现，我们的研究致力于通过建立代码作为强大感知媒介来系统提升MLLMs的感知能力——可执行代码提供的精确语义天然契合STEM视觉内容的结构化特性。具体而言，我们构建了ICC-1M大规模数据集，包含100万个图像-描述-代码三元组，通过两种互补方法实现这一代码即感知范式：（1）代码锚定描述生成将可执行代码作为图像描述的真实基准，消除现有知识蒸馏方法固有的幻觉问题；（2）STEM图像到代码转换引导模型生成重构代码，通过规避自然语言的模糊性来增强感知。为验证该范式，我们进一步推出STEM2Code-Eval新型基准测试，直接评估STEM领域的视觉感知能力。与依赖解题准确率作为代理指标（仅衡量问题相关理解）的现有研究不同，我们的基准要求通过生成可执行代码实现图像重构，从而达成对视觉内容的全面理解，并提供确定性与可验证的评估方案。代码已开源：https://github.com/TongkunGuan/Qwen-CodePercept。

English

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

CodePercept：面向多模态大语言模型的代码驱动型视觉STEM感知框架

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

摘要

Support