VisualSphinx：面向強化學習的大規模合成視覺邏輯謎題

摘要

視覺語言模型（VLMs）被期望能進行有效的多模態推理並做出邏輯一致的決策，這對於圖表理解和空間問題解決等任務至關重要。然而，當前的VLM推理缺乏大規模且結構良好的訓練數據集。為彌補這一差距，我們提出了VisualSphinx，這是一種首創的大規模合成視覺邏輯推理訓練數據。為應對基於答案的圖像合成挑戰，我們提出了一種規則到圖像的合成流程，該流程從種子問題中提取並擴展謎題規則，並生成用於謎題樣本組裝的基礎合成圖像的代碼。實驗表明，使用GRPO在VisualSphinx上訓練的VLM受益於我們數據集的邏輯一致性和可讀性，並在邏輯推理任務上表現出提升的性能。從VisualSphinx發展出的增強推理能力也對其他推理任務如代數推理、算術推理和幾何推理有所助益。

English

Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.

VisualSphinx：面向強化學習的大規模合成視覺邏輯謎題

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

摘要

Support