以玩促泛化：通过游戏玩法学习推理

摘要

在多模态大语言模型（MLLMs）中开发可泛化的推理能力仍具挑战性。受认知科学文献启发，游戏玩法有助于培养可迁移的认知技能，我们提出了一种新颖的后训练范式——视觉游戏学习（ViGaL），通过让MLLMs玩街机类游戏，实现跨领域的多模态推理泛化。具体而言，我们展示了对一个拥有70亿参数的MLLM，在简单街机游戏（如贪吃蛇）上通过强化学习（RL）进行后训练，显著提升了其在多模态数学基准（如MathVista）及跨学科问题（如MMMU）上的下游表现，且在整个RL过程中未接触任何解题步骤、方程或图表，这表明模型掌握了可迁移的推理技能。值得注意的是，我们的模型在多模态推理基准测试中超越了专门针对多模态推理数据调优的专家模型，同时保持了基础模型在通用视觉基准上的性能，这是专家模型常面临的难题。我们的研究揭示了一种新的后训练范式：基于规则的人工合成游戏可作为可控且可扩展的预训练任务，激发MLLMs中可泛化的多模态推理能力。

English

Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

以玩促泛化：通过游戏玩法学习推理

Play to Generalize: Learning to Reason Through Game Play

摘要

Support