유클리드의 선물: 기하학적 대리 과제를 통해 시각-언어 모델의 공간 인식 및 추론 능력 향상

초록

공간 지능은 도형을 시각화하고 변형하기, 물체를 정신적으로 회전시키기, 관계적 위치와 포함 여부를 판단하기, 수량을 추정하기 등 다양한 능력을 포괄합니다. 그러나 이러한 능력은 여전히 멀티모달 대형 언어 모델(MLLMs)에게 해결되지 않은 중요한 과제로 남아 있습니다. 이 격차를 메우기 위해 우리는 유클리드 기하학 문제 해결을 대리 과제로 삼는 접근법을 제안합니다. 구체적으로, 우리는 약 30,000개의 평면 및 입체 기하학 문제로 구성된 Euclid30K라는 정제된 멀티모달 데이터셋을 신중하게 구축했습니다. 모델이 이러한 기하학 문제로부터 유클리드 원리를 학습하고 적용할 수 있도록, 우리는 Group Relative Policy Optimization(GRPO)을 사용하여 Qwen2.5VL 및 RoboBrain2.0 모델군을 미세 조정했습니다. 이를 통해 모델이 도형을 식별하고, 수를 세고, 개체 간 관계를 파악하며, 유클리드 원리를 사용한 다단계 추론을 수행하도록 유도했습니다. 우리의 실험 결과, 결과 모델들은 특정 작업에 맞춤화 없이도 네 가지 공간 추론 벤치마크(Super-CLEVR, Omni3DBench, VSI-Bench, MindCube)에서 상당한 제로샷 성능 향상을 달성했습니다. 특히, Euclid30K로 학습한 후 평가된 모든 모델의 평균 VSI-Bench 정확도는 34.5%에서 40.5%로 5.5% 포인트 상승했습니다. 이 중 RoboBrain2.0-Euclid-7B는 49.6%의 정확도를 달성하며, 이전 최첨단 모델인 Spatial-MLLM을 능가했습니다. 우리가 아는 한, 기하학 중심의 미세 조정이 시각-언어 모델에 광범위하게 전이 가능한 공간 기술을 부여할 수 있다는 것을 보여준 첫 체계적인 연구입니다. 코드와 Euclid30K 데이터셋은 https://zgca-ai4edu.github.io/Euclids_Gift에서 확인할 수 있습니다.

English

Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6\% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

유클리드의 선물: 기하학적 대리 과제를 통해 시각-언어 모델의 공간 인식 및 추론 능력 향상

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

초록

Support