MARS2 2025多模態推理挑戰賽：數據集、方法、成果、討論與展望

摘要

本文回顧了MARS2 2025多模態推理挑戰賽。我們旨在通過一個大型基準測試，匯聚多模態機器學習與大型語言模型（LLMs）的不同方法，以期讓研究人員更好地跟蹤這一極具活力領域的最新進展。與此同時，日益增多的測試平台推動了通用大型語言模型的發展。因此，今年的MARS2聚焦於現實世界和專業場景，以拓寬多模態語言模型（MLLMs）的推理應用。我們的組織團隊發布了兩個定製數據集Lens和AdsQA作為測試集，分別支持12種日常場景中的通用推理和廣告視頻中的領域特定推理。我們評估了包含通用MLLMs和任務特定模型在內的40多個基線模型，並開設了三個競賽賽道，即現實場景中的視覺定位（VG-RS）、具備空間意識的視覺問答（VQA-SA）以及創意廣告視頻中的視覺推理（VR-Ads）。最終，來自知名學術和工業機構的76支團隊報名參賽，超過40份有效提交（總提交量超過1200份）被納入我們的排名列表。我們的數據集、代碼集（40多個基線模型和15多個參賽者的方法）以及排名已在MARS2研討會網站和我們的GitHub組織頁面https://github.com/mars2workshop/上公開，我們將持續提供更新和即將舉辦活動的公告。

English

This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.