VisualSphinx：面向强化学习的大规模合成视觉逻辑谜题

摘要

视觉语言模型（VLMs）被期望能够执行有效的多模态推理并做出逻辑连贯的决策，这对于图表理解和空间问题解决等任务至关重要。然而，当前的VLM推理缺乏大规模且结构良好的训练数据集。为了填补这一空白，我们提出了VisualSphinx，这是首个大规模合成的视觉逻辑推理训练数据。为了解决图像合成与答案定位的挑战，我们提出了一种规则到图像的合成流程，该流程从种子问题中提取并扩展谜题规则，并生成用于谜题样本组装的定位合成图像代码。实验表明，使用GRPO在VisualSphinx上训练的VLM受益于我们数据集的逻辑连贯性和可读性，并在逻辑推理任务上表现出改进的性能。从VisualSphinx中发展出的增强推理能力也惠及其他推理任务，如代数推理、算术推理和几何推理。

English

Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.