VisualSphinx: 強化学習のための大規模合成視覚論理パズル

要旨

視覚言語モデル（VLM）は、効果的なマルチモーダル推論を行い、論理的に一貫した意思決定を下すことが期待されており、図表理解や空間問題解決などのタスクにおいて重要です。しかし、現在のVLMの推論能力は、大規模で構造化されたトレーニングデータセットの不足に悩まされています。このギャップを埋めるため、我々はVisualSphinxを提案します。これは、初の大規模合成視覚論理推論トレーニングデータです。画像合成と接地された回答を伴う課題に対処するため、ルールから画像を合成するパイプラインを提案します。このパイプラインは、シード質問からパズルのルールを抽出・拡張し、パズルサンプルのアセンブリのための接地合成画像合成のコードを生成します。実験により、VisualSphinxを使用してGRPOでトレーニングされたVLMは、データセットの論理的一貫性と可読性の恩恵を受け、論理推論タスクにおいて性能が向上することが示されました。VisualSphinxから発展した強化された推論能力は、代数推論、算術推論、幾何学推論などの他の推論タスクにも役立ちます。

English

Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.

VisualSphinx: 強化学習のための大規模合成視覚論理パズル

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

要旨

Support