阿里阿德涅:可调控的视觉语言模型推理边界探索与扩展框架
Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries
November 1, 2025
作者: Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu
cs.AI
摘要
尽管经过强化学习(RL)后训练的视觉语言模型(VLMs)展现出卓越的通用推理能力,但其评估通常局限于语言主导型任务(如数学推理)。这引出一个关键问题:对于基础VLM初始表现失败的视觉中心型空间任务,RL后训练能否真正拓展其固有能力边界?为探究此问题,我们提出Ariadne框架——通过可精确控制任务难度(如路径长度、转弯次数)的合成迷宫进行多步空间推理。利用这一可控环境,我们采用带验证奖励的强化学习(RLVR)在难度感知课程中训练VLMs。令人惊讶的是,经过RLVR后训练的VLM在基础模型得分为0%的问题集上实现了超过50%的准确率,证明我们的方法拓展了模型的初始能力边界。为评估现实可行性,我们在实际基准测试中评估了分布外(OOD)泛化能力。尽管仅使用合成迷宫样本进行训练,Ariadne在MapBench(如博物馆导航)和ReasonMap(地铁换乘任务)上分别实现了16%和24%的平均零样本提升,这表明我们的方法不仅拓宽了模型的基础能力极限,还增强了其在现实空间推理任务中的泛化能力。我们承认本研究受限于预训练数据的不透明性而聚焦于后训练阶段,期待我们的工作能推动针对能力边界拓展的专项对齐研究。
English
While Vision-Language Models (VLMs) post-trained with Reinforcement Learning
(RL) show impressive general reasoning, their evaluation is often confined to
language-dominant tasks (e.g., math). This raises a critical question: can RL
post-training truly extend the inherent capability boundary of a base VLM,
particularly for visual-centric spatial tasks where it initially fails? To
investigate this, we introduce Ariadne, a framework utilizing synthetic mazes
for multi-step spatial reasoning where task difficulty (e.g., path length,
turns) is precisely controlled. We leverage this controllable environment to
train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a
difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves
over 50% accuracy on a problem set where the base model scored 0%,
demonstrating that our approach expands the model's initial capability
boundary. To assess real-world viability, we evaluate out-of-distribution (OOD)
generalization on practical benchmarks. Despite training only on synthetic maze
samples, Ariadne achieves significant zero-shot improvements, averaging 16% on
MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer
tasks). These results confirm that our method not only broadens the model's
fundamental limits but also enhances its generalization to real-world spatial
reasoning. We acknowledge our study is limited to the post-training phase,
given the opaqueness of pre-training data, and hope our research motivates
further work on specialized, capability-extending alignment.