科学多模态推理:技术报告与ICML 2025 SeePhys挑战赛冠军方案
Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
September 7, 2025
作者: Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong
cs.AI
摘要
多模态推理依然是人工智能领域的一项根本性挑战。尽管在基于文本的推理方面取得了显著进展,即便是诸如GPT-3这样的顶尖模型,在多模态场景下也难以保持强劲表现。为弥补这一差距,我们提出了一种字幕辅助推理框架,有效连接了视觉与文本模态。该方法在ICML 2025 AI for Math研讨会暨SeePhys挑战赛中荣膺榜首,充分证明了其效能与鲁棒性。此外,我们在MathVerse基准测试上验证了其在几何推理任务中的泛化能力,展现了我们方法的广泛适用性。相关代码已公开于https://github.com/OpenDCAI/SciReasoner。
English
Multimodal reasoning remains a fundamental challenge in artificial
intelligence. Despite substantial advances in text-based reasoning, even
state-of-the-art models such as GPT-o3 struggle to maintain strong performance
in multimodal scenarios. To address this gap, we introduce a caption-assisted
reasoning framework that effectively bridges visual and textual modalities. Our
approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge
2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we
validate its generalization on the MathVerse benchmark for geometric reasoning,
demonstrating the versatility of our method. Our code is publicly available at
https://github.com/OpenDCAI/SciReasoner.