欧几里得的馈赠:通过几何代理任务增强视觉-语言模型的空间感知与推理能力
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
September 29, 2025
作者: Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen
cs.AI
摘要
空间智能涵盖了一系列丰富的能力,包括形状的可视化与变换、物体的心理旋转、关系位置与包含关系的判断以及数量估计。然而,这仍然是多模态大语言模型(MLLMs)面临的一个关键未解难题。为填补这一空白,我们提出将欧几里得几何问题解决作为替代任务。具体而言,我们精心构建了一个名为Euclid30K的多模态数据集,包含约30K个平面与立体几何问题。为了使模型能够从这些几何问题中学习并应用欧几里得原理,我们采用了群体相对策略优化(GRPO)对Qwen2.5VL系列和RoboBrain2.0系列进行微调,激励模型识别形状、计数、关联实体,并运用欧几里得原理进行多步演绎推理。实验表明,经过训练的模型在四个空间推理基准测试(Super-CLEVR、Omni3DBench、VSI-Bench和MindCube)上均实现了显著的零样本性能提升,无需任何任务特定适配。值得注意的是,在Euclid30K上训练后,所有评估模型的VSI-Bench平均准确率从34.5%提升至40.5%,提高了5.5个百分点。其中,RoboBrain2.0-Euclid-7B以49.6%的准确率超越了之前的最先进模型Spatial-MLLM。据我们所知,这是首次系统研究表明,以几何为中心的微调能够赋予视觉语言模型广泛可迁移的空间技能。代码及Euclid30K数据集可在https://zgca-ai4edu.github.io/Euclids_Gift获取。
English
Spatial intelligence spans a rich suite of abilities, including visualising
and transforming shapes, mentally rotating objects, judging relational
positions and containment, and estimating numerosity. However, it still remains
a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To
fill this gap, we propose to treat Euclidean geometry problem-solving as a
surrogate task. Specifically, we meticulously constructed a curated multimodal
dataset, called Euclid30K, comprising approximately 30K plane and solid
geometry problems. To enable the model to acquire and apply Euclidean
principles from these geometry problems, we employed Group Relative Policy
Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family,
inspiring the models to identify shapes, count, and relate entities, and
perform multi-step deductive reasoning using Euclidean principles. Our
experiments demonstrate that the resulting models achieve substantial zero-shot
gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench,
VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after
training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models
rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them,
RoboBrain2.0-Euclid-7B achieves 49.6\% accuracy, surpassing the previous
state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first
systematic study showing that geometry-centric fine-tuning can confer
vision-language models with broadly transferable spatial skills. Code and
Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.