一般化のためのプレイ：ゲームプレイを通じた推論学習

要旨

マルチモーダル大規模言語モデル（MLLMs）における汎用的な推論能力の開発は依然として課題である。認知科学の文献が示唆するように、ゲームプレイは転移可能な認知スキルを促進することを動機として、我々は新しいポストトレーニングパラダイム、Visual Game Learning（ViGaL）を提案する。ここでは、MLLMsがアーケード風ゲームをプレイすることで、マルチモーダル推論のドメイン外汎化能力を発展させる。具体的には、7BパラメータのMLLMを、Snakeのような単純なアーケード風ゲームで強化学習（RL）を用いてポストトレーニングすることで、MathVistaのようなマルチモーダル数学ベンチマークや、MMMUのような多分野問題における下流タスクの性能が大幅に向上することを示す。この際、RL中に解答例、方程式、図表を見ることはなく、転移可能な推論スキルの獲得を示唆している。注目すべきは、我々のモデルが、マルチモーダル推論データに特化した専門モデルをマルチモーダル推論ベンチマークで上回りながら、ベースモデルの一般的な視覚ベンチマークにおける性能を維持することである。これは、専門モデルがしばしば達成できない課題である。我々の研究結果は、新しいポストトレーニングパラダイムを示唆している：合成的でルールベースのゲームは、MLLMsにおける汎用的なマルチモーダル推論能力を引き出すための制御可能かつスケーラブルな事前タスクとして機能し得る。

English

Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

一般化のためのプレイ：ゲームプレイを通じた推論学習

Play to Generalize: Learning to Reason Through Game Play

要旨

Support