多模態推理於科學應用：技術報告與ICML 2025 SeePhys挑戰賽冠軍方案

摘要

多模態推理仍然是人工智慧領域的一個根本性挑戰。儘管在基於文本的推理方面取得了顯著進展，即便是最先進的模型如GPT-3，在多模態場景中仍難以保持強勁的表現。為彌補這一差距，我們引入了一種字幕輔助推理框架，有效橋接了視覺與文本模態。我們的方法在ICML 2025 AI for Math Workshop & Challenge 2: SeePhys中榮獲第一名，彰顯了其效能與魯棒性。此外，我們在MathVerse基準上驗證了其在幾何推理中的泛化能力，展示了我們方法的廣泛適用性。我們的程式碼已公開於https://github.com/OpenDCAI/SciReasoner。

English

Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.

多模態推理於科學應用：技術報告與ICML 2025 SeePhys挑戰賽冠軍方案

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

摘要

Support