GLM-5V-Turbo: 멀티모달 에이전트를 위한 본격적인 파운데이션 모델로의 진화

초록

GLM-5V-Turbo는 멀티모달 에이전트를 위한 네이티브 파운데이션 모델로 나아가는 중요한 단계를 제시합니다. 파운데이션 모델이 실제 환경에 점차 배포됨에 따라, 에이전트 능력은 언어 추론 능력뿐만 아니라 이미지, 비디오, 웹페이지, 문서, GUI와 같은 이질적인 콘텍스트를 인지하고 해석하며 작동하는 능력에도 좌우됩니다. GLM-5V-Turbo는 이러한 목표를 중심으로 구축되었습니다. 즉, 멀티모달 인지가 언어 모델에 대한 보조 인터페이스가 아닌 추론, 계획, 도구 사용 및 실행의 핵심 구성 요소로 통합되었습니다. 본 보고서는 모델 설계, 멀티모달 학습, 강화 학습, 툴체인 확장 및 에이전트 프레임워크 통합에 이르는 GLM-5V-Turbo의 주요 개선 사항을 요약합니다. 이러한 발전은 경쟁력 있는 텍스트 전용 코딩 능력을 유지하면서 멀티모달 코딩, 시각적 도구 사용 및 프레임워크 기반 에이전트 작업에서 강력한 성능으로 이어집니다. 더욱 중요한 것은, 저희의 개발 과정이 멀티모달 에이전트 구축을 위한 실질적인 통찰을 제공하며, 멀티모달 인지의 중심 역할, 계층적 최적화, 그리고 신뢰할 수 있는 엔드투엔드 검증의 중요성을 부각시킨다는 점입니다.

English

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

GLM-5V-Turbo: 멀티모달 에이전트를 위한 본격적인 파운데이션 모델로의 진화

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

초록

Support