GLM-5V-Turbo:邁向原生多模態代理基礎模型
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
April 29, 2026
作者: V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehai He, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yijian Lu, Yanzi Wang, Yadong Xue, Xinyu Zhang, Xinyu Liu, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Haozhi Zheng, Haoran Wang, Haochen Li, Fan Yang, Dan Zhang, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowei Jia, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
cs.AI
摘要
我們推出 GLM-5V-Turbo,這是邁向原生多模態智能體基礎模型的重要一步。隨著基礎模型在真實環境中的部署日益普及,智能體能力不僅取決於語言推理,更需具備對圖像、影片、網頁、文檔、圖形界面等異構情境的感知、解析與行動能力。GLM-5V-Turbo 正是圍繞此目標構建:多模態感知被整合為推理、規劃、工具使用與執行的核心組件,而非僅作為語言模型的附屬接口。本報告總結了 GLM-5V-Turbo 在模型設計、多模態訓練、強化學習、工具鏈擴展及智能體框架整合等方面的核心改進。這些進展使其在多模態編程、視覺工具運用及基於框架的智能體任務中表現卓越,同時保持具有競爭力的純文本編程能力。更重要的是,我們的開發過程為構建多模態智能體提供了實用見解,彰顯了多模態感知、層次化優化與可靠端到端驗證的關鍵作用。
English
We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.