MARS2 2025多模态推理挑战赛:数据集、方法、成果、讨论与展望
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
September 17, 2025
作者: Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang, Qiang Zhou, Yichen Zhao, Shili Xiong, Hyeongjin Nam, Jaerin Lee, Jaeyoung Chung, JoonKyu Park, Junghun Oh, Kanggeon Lee, Wooseok Lee, Juneyoung Ro, Turghun Osman, Can Hu, Chaoyang Liao, Cheng Chen, Chengcheng Han, Chenhao Qiu, Chong Peng, Cong Xu, Dailin Li, Feiyu Wang, Feng Gao, Guibo Zhu, Guopeng Tang, Haibo Lu, Han Fang, Han Qi, Hanxiao Wu, Haobo Cheng, Hongbo Sun, Hongyao Chen, Huayong Hu, Hui Li, Jiaheng Ma, Jiang Yu, Jianing Wang, Jie Yang, Jing He, Jinglin Zhou, Jingxuan Li, Josef Kittler, Lihao Zheng, Linnan Zhao, Mengxi Jia, Muyang Yan, Nguyen Thanh Thien, Pu Luo, Qi Li, Shien Song, Shijie Dong, Shuai Shao, Shutao Li, Taofeng Xue, Tianyang Xu, Tianyi Gao, Tingting Li, Wei Zhang, Weiyang Su, Xiaodong Dong, Xiao-Jun Wu, Xiaopeng Zhou, Xin Chen, Xin Wei, Xinyi You, Xudong Kang, Xujie Zhou, Xusheng Liu, Yanan Wang, Yanbin Huang, Yang Liu, Yang Yang, Yanglin Deng, Yashu Kang, Ye Yuan, Yi Wen, Yicen Tian, Yilin Tao, Yin Tang, Yipeng Lin, Yiqing Wang, Yiting Xi, Yongkang Yu, Yumei Li, Yuxin Qin, Yuying Chen, Yuzhe Cen, Zhaofan Zou, Zhaohong Liu, Zhehao Shen, Zhenglin Du, Zhengyang Li, Zhenni Huang, Zhenwei Shao, Zhilong Song, Zhiyong Feng, Zhiyu Wang, Zhou Yu, Ziang Li, Zihan Zhai, Zijian Zhang, Ziyang Peng, Ziyun Xiao, Zongshu Li
cs.AI
摘要
本文回顾了2025年MARS2多模态推理挑战赛。我们旨在通过一个大型基准测试,将多模态机器学习与大型语言模型(LLMs)的不同方法汇聚一堂,以期让研究者们更好地跟进这一快速发展的前沿领域。与此同时,日益增多的测试平台推动了通用大型语言模型的演进。因此,今年的MARS2聚焦于现实世界及特定场景,以拓宽多模态大模型(MLLMs)在推理应用中的边界。我们的组织团队发布了两个定制数据集——Lens和AdsQA作为测试集,分别支持12种日常场景中的通用推理以及广告视频中的领域特定推理。我们评估了包含通用MLLMs和任务专用模型在内的40多个基线模型,并开设了三个竞赛赛道:现实场景中的视觉定位(VG-RS)、具备空间意识的视觉问答(VQA-SA)以及创意广告视频中的视觉推理(VR-Ads)。最终,来自知名学术机构与企业的76支队伍报名参赛,超过40份有效提交(总计1200+)被纳入我们的排行榜。我们的数据集、代码集(40+基线模型与15+参赛者方法)及排名已在MARS2研讨会官网及我们的GitHub组织页面https://github.com/mars2workshop/上公开,我们将持续提供更新及未来活动的公告。
English
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim
to bring together different approaches in multimodal machine learning and LLMs
via a large benchmark. We hope it better allows researchers to follow the
state-of-the-art in this very dynamic area. Meanwhile, a growing number of
testbeds have boosted the evolution of general-purpose large language models.
Thus, this year's MARS2 focuses on real-world and specialized scenarios to
broaden the multimodal reasoning applications of MLLMs. Our organizing team
released two tailored datasets Lens and AdsQA as test sets, which support
general reasoning in 12 daily scenarios and domain-specific reasoning in
advertisement videos, respectively. We evaluated 40+ baselines that include
both generalist MLLMs and task-specific models, and opened up three competition
tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question
Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative
Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and
industrial institutions have registered and 40+ valid submissions (out of
1200+) have been included in our ranking lists. Our datasets, code sets (40+
baselines and 15+ participants' methods), and rankings are publicly available
on the MARS2 workshop website and our GitHub organization page
https://github.com/mars2workshop/, where our updates and announcements of
upcoming events will be continuously provided.