VisualSphinx: 강화학습을 위한 대규모 합성 시각 논리 퍼즐

초록

비전 언어 모델(VLMs)은 효과적인 다중 모드 추론을 수행하고 논리적으로 일관된 결정을 내릴 것으로 기대되며, 이는 다이어그램 이해 및 공간 문제 해결과 같은 작업에 매우 중요합니다. 그러나 현재 VLM 추론은 대규모이면서도 잘 구조화된 훈련 데이터셋이 부족한 상황입니다. 이러한 격차를 해소하기 위해, 우리는 최초의 대규모 합성 시각적 논리 추론 훈련 데이터인 VisualSphinx를 제안합니다. 답변을 기반으로 한 이미지 합성의 도전 과제를 해결하기 위해, 우리는 규칙에서 이미지로의 합성 파이프라인을 제안합니다. 이 파이프라인은 시드 질문에서 퍼즐 규칙을 추출하고 확장하며, 퍼즐 샘플 조립을 위한 기반 합성 이미지 합성 코드를 생성합니다. 실험 결과, VisualSphinx를 사용하여 GRPO로 훈련된 VLM은 우리 데이터셋의 논리적 일관성과 가독성으로부터 이점을 얻으며, 논리 추론 작업에서 향상된 성능을 보여줍니다. VisualSphinx에서 개발된 강화된 추론 능력은 대수 추론, 산술 추론 및 기하학적 추론과 같은 다른 추론 작업에도 유익합니다.

English

Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.

VisualSphinx: 강화학습을 위한 대규모 합성 시각 논리 퍼즐

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

초록

Support