ChatPaper.aiChatPaper

GLM-4.1V-思维:迈向基于可扩展强化学习的多功能多模态推理

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

July 1, 2025
作者: Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianle Gong, Wenkai Li, Wei Jia, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuxuan Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
cs.AI

摘要

我们推出GLM-4.1V-Thinking,这是一款旨在推动通用多模态推理的视觉语言模型(VLM)。在本报告中,我们分享了在开发以推理为核心的训练框架过程中的关键发现。首先,通过大规模预训练,我们构建了一个具有显著潜力的视觉基础模型,这无疑为最终性能设定了上限。随后,采用课程采样强化学习(RLCS)充分释放了模型的潜力,实现了在STEM问题解决、视频理解、内容识别、编码、定位、基于GUI的代理以及长文档理解等多样化任务上的全面能力提升。为促进该领域的研究,我们开源了GLM-4.1V-9B-Thinking,该模型在同等规模模型中达到了顶尖性能。在涵盖28个公开基准的综合评估中,我们的模型在几乎所有任务上均优于Qwen2.5-VL-7B,并在18个基准上相对于规模显著更大的Qwen2.5-VL-72B取得了相当甚至更优的表现。值得注意的是,GLM-4.1V-9B-Thinking在长文档理解和STEM推理等挑战性任务上,与GPT-4o等闭源模型相比也展现出竞争力或更优性能,进一步凸显了其强大能力。代码、模型及更多信息发布于https://github.com/THUDM/GLM-4.1V-Thinking。
English
We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.
PDF1473July 2, 2025