GLM-5V-Turbo:迈向原生多模态智能体基础模型
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
April 29, 2026
作者: V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehai He, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yijian Lu, Yanzi Wang, Yadong Xue, Xinyu Zhang, Xinyu Liu, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Haozhi Zheng, Haoran Wang, Haochen Li, Fan Yang, Dan Zhang, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowei Jia, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
cs.AI
摘要
我们推出GLM-5V-Turbo,这是迈向原生多模态智能体基础模型的重要一步。随着基础模型在真实环境中的日益普及,智能体能力不仅依赖于语言推理,更需要具备对图像、视频、网页、文档、图形界面等异构环境的感知、解析与交互能力。GLM-5V-Turbo正是围绕这一目标构建:多模态感知被整合为推理、规划、工具使用与执行的核心组成部分,而非作为语言模型的附属接口。本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展及智能体框架集成等方面的核心改进。这些进展使其在多模态编程、视觉工具使用和框架驱动的智能体任务中表现卓越,同时保持了具有竞争力的纯文本编程能力。更重要的是,我们的开发过程为构建多模态智能体提供了实用洞见,凸显了多模态感知、分层优化及可靠端到端验证的核心作用。
English
We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.