GLM-4.1V-Thinking：スケーラブルな強化学習による汎用的マルチモーダル推論に向けて

要旨

本稿では、汎用的なマルチモーダル推論を進化させるために設計された視覚言語モデル（VLM）であるGLM-4.1V-Thinkingを紹介します。本報告では、推論中心のトレーニングフレームワークの開発における主要な知見を共有します。まず、大規模な事前学習を通じて、最終的な性能の上限を設定する可能性を秘めた強力な視覚基盤モデルを開発しました。その後、カリキュラムサンプリングを伴う強化学習（RLCS）により、モデルの全潜在能力を引き出し、STEM問題解決、ビデオ理解、コンテンツ認識、コーディング、グラウンディング、GUIベースのエージェント、長文書理解など、多様なタスクにわたる包括的な能力向上を実現しました。この分野の研究を促進するため、我々はGLM-4.1V-9B-Thinkingをオープンソース化しました。このモデルは、同規模のモデルの中で最先端の性能を達成しています。28の公開ベンチマークにわたる包括的な評価において、我々のモデルはQwen2.5-VL-7Bをほぼ全てのタスクで上回り、さらに大幅に大規模なQwen2.5-VL-72Bに対して18のベンチマークで同等または優れた性能を示しました。特に、GLM-4.1V-9B-Thinkingは、長文書理解やSTEM推論などの挑戦的なタスクにおいて、GPT-4oなどのクローズドソースモデルと比較しても競争力のある、あるいは優れた性能を発揮し、その強力な能力をさらに裏付けています。コード、モデル、および詳細情報はhttps://github.com/THUDM/GLM-4.1V-Thinkingで公開されています。

English

We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.

GLM-4.1V-Thinking：スケーラブルな強化学習による汎用的マルチモーダル推論に向けて

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

要旨

Support