科学多模态推理：技术报告与ICML 2025 SeePhys挑战赛冠军方案

摘要

多模态推理依然是人工智能领域的一项根本性挑战。尽管在基于文本的推理方面取得了显著进展，即便是诸如GPT-3这样的顶尖模型，在多模态场景下也难以保持强劲表现。为弥补这一差距，我们提出了一种字幕辅助推理框架，有效连接了视觉与文本模态。该方法在ICML 2025 AI for Math研讨会暨SeePhys挑战赛中荣膺榜首，充分证明了其效能与鲁棒性。此外，我们在MathVerse基准测试上验证了其在几何推理任务中的泛化能力，展现了我们方法的广泛适用性。相关代码已公开于https://github.com/OpenDCAI/SciReasoner。

English

Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.

科学多模态推理：技术报告与ICML 2025 SeePhys挑战赛冠军方案

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

摘要

Support