GLM-5V-Turbo：マルチモーダルエージェントのためのネイティブ基盤モデルへ

要旨

我々はGLM-5V-Turboを発表します。これはマルチモーダルエージェントのためのネイティブ基盤モデルへの第一歩です。基盤モデルが実環境に展開されるにつれ、エージェント能力は言語推論だけでなく、画像、動画、ウェブページ、文書、GUIといった多様なコンテキストを認識・解釈・操作する能力にも依存するようになっています。GLM-5V-Turboはこの目標を中核に設計されています：マルチモーダル知覚は、言語モデルへの補助的インターフェースではなく、推論、計画、ツール使用、実行の核心要素として統合されています。本報告では、モデル設計、マルチモーダル学習、強化学習、ツールチェーン拡張、エージェントフレームワーク連携における主要な改善点をまとめます。これらの進展により、テキストのみのコーディング能力を維持しつつ、マルチモーダルコーディング、視覚的ツール使用、フレームワーク連携タスクで優れた性能を発揮します。さらに重要なのは、我々の開発プロセスがマルチモーダルエージェント構築への実践的知見を提供し、マルチモーダル知覚の中心性、階層的最適化、信頼性の高いエンドツーエンド検証の重要性を明らかにしている点です。

English

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

GLM-5V-Turbo：マルチモーダルエージェントのためのネイティブ基盤モデルへ

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

要旨

Support