ChatPaper.aiChatPaper

多模態推理於科學應用:技術報告與ICML 2025 SeePhys挑戰賽冠軍方案

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

September 7, 2025
作者: Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong
cs.AI

摘要

多模態推理仍然是人工智慧領域的一個根本性挑戰。儘管在基於文本的推理方面取得了顯著進展,即便是最先進的模型如GPT-3,在多模態場景中仍難以保持強勁的表現。為彌補這一差距,我們引入了一種字幕輔助推理框架,有效橋接了視覺與文本模態。我們的方法在ICML 2025 AI for Math Workshop & Challenge 2: SeePhys中榮獲第一名,彰顯了其效能與魯棒性。此外,我們在MathVerse基準上驗證了其在幾何推理中的泛化能力,展示了我們方法的廣泛適用性。我們的程式碼已公開於https://github.com/OpenDCAI/SciReasoner。
English
Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.
PDF32September 17, 2025