과학을 위한 다중모달 추론: ICML 2025 SeePhys 챌린지 1위 솔루션 및 기술 보고서

초록

다중 모달 추론은 여전히 인공지능 분야의 근본적인 과제로 남아 있습니다. 텍스트 기반 추론에서 상당한 진전이 있었음에도 불구하고, GPT-3와 같은 최첨단 모델들조차 다중 모달 시나리오에서 강력한 성능을 유지하는 데 어려움을 겪고 있습니다. 이러한 격차를 해결하기 위해, 우리는 시각적 및 텍스트적 모달리티를 효과적으로 연결하는 캡션 지원 추론 프레임워크를 소개합니다. 우리의 접근 방식은 ICML 2025 AI for Math 워크숍 및 챌린지 2: SeePhys에서 1위를 차지하며 그 효과성과 견고성을 입증했습니다. 또한, 기하학적 추론을 위한 MathVerse 벤치마크에서 일반화 능력을 검증함으로써 우리 방법의 다용성을 입증했습니다. 우리의 코드는 https://github.com/OpenDCAI/SciReasoner에서 공개적으로 이용 가능합니다.

English

Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.

과학을 위한 다중모달 추론: ICML 2025 SeePhys 챌린지 1위 솔루션 및 기술 보고서

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

초록

Support