歐幾里得的饋贈:通過幾何代理任務增強視覺-語言模型的空間感知與推理能力
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
September 29, 2025
作者: Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen
cs.AI
摘要
空間智能涵蓋了一系列豐富的能力,包括視覺化與變換形狀、在腦中旋轉物體、判斷相對位置與包含關係,以及估算數量。然而,這仍是多模態大型語言模型(MLLMs)面臨的一個關鍵未解難題。為填補這一空白,我們提議將歐幾里得幾何問題解決作為替代任務。具體而言,我們精心構建了一個名為Euclid30K的多模態數據集,包含約30K道平面與立體幾何問題。為了讓模型能從這些幾何問題中學習並應用歐幾里得原理,我們採用群組相對策略優化(GRPO)對Qwen2.5VL系列和RoboBrain2.0系列進行微調,激勵模型識別形狀、計數、關聯實體,並運用歐幾里得原理進行多步演繹推理。實驗結果顯示,經過訓練的模型在四個空間推理基準測試(Super-CLEVR、Omni3DBench、VSI-Bench和MindCube)上均實現了顯著的零樣本性能提升,無需任何任務特定適應。值得注意的是,在Euclid30K上訓練後,所有評估模型的VSI-Bench平均準確率從34.5%提升至40.5%,提高了5.5個百分點。其中,RoboBrain2.0-Euclid-7B以49.6%的準確率超越了先前的最先進模型Spatial-MLLM。據我們所知,這是首次系統性研究表明,以幾何為中心的微調能賦予視覺-語言模型廣泛可遷移的空間技能。代碼及Euclid30K數據集可在https://zgca-ai4edu.github.io/Euclids_Gift獲取。
English
Spatial intelligence spans a rich suite of abilities, including visualising
and transforming shapes, mentally rotating objects, judging relational
positions and containment, and estimating numerosity. However, it still remains
a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To
fill this gap, we propose to treat Euclidean geometry problem-solving as a
surrogate task. Specifically, we meticulously constructed a curated multimodal
dataset, called Euclid30K, comprising approximately 30K plane and solid
geometry problems. To enable the model to acquire and apply Euclidean
principles from these geometry problems, we employed Group Relative Policy
Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family,
inspiring the models to identify shapes, count, and relate entities, and
perform multi-step deductive reasoning using Euclidean principles. Our
experiments demonstrate that the resulting models achieve substantial zero-shot
gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench,
VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after
training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models
rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them,
RoboBrain2.0-Euclid-7B achieves 49.6\% accuracy, surpassing the previous
state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first
systematic study showing that geometry-centric fine-tuning can confer
vision-language models with broadly transferable spatial skills. Code and
Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.