Ariadne:探索与拓展视觉语言模型推理边界的可控框架
Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries
November 1, 2025
作者: Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu
cs.AI
摘要
尽管经过强化学习(RL)后训练的视觉语言模型(VLMs)展现出令人印象深刻的通用推理能力,但其评估往往局限于语言主导型任务(如数学推理)。这引发了一个关键问题:RL后训练是否真能拓展基础VLM固有的能力边界——尤其是在模型最初无法解决的视觉中心型空间任务上?为探究此问题,我们提出Ariadne框架,该框架利用合成迷宫进行多步空间推理,并精确控制任务难度(如路径长度、转弯次数)。我们通过难度感知课程学习,在此可控环境中运用验证奖励强化学习(RLVR)对VLMs进行训练。令人惊讶的是,经过RLVR后训练后,VLM在基础模型得分为0%的问题集上准确率超过50%,证明我们的方法拓展了模型的初始能力边界。为评估实际应用潜力,我们在实用基准测试中评估了分布外(OOD)泛化能力。尽管仅使用合成迷宫样本进行训练,Ariadne在MapBench(如博物馆导航)和ReasonMap(地铁换乘任务)上分别实现了16%和24%的平均零样本提升。这些结果证实我们的方法不仅拓宽了模型的基础能力极限,还增强了其在现实世界空间推理中的泛化能力。我们承认本研究受限于预训练数据的不透明性,仅聚焦于后训练阶段,期待我们的工作能推动针对能力边界拓展的专项对齐研究。
English
While Vision-Language Models (VLMs) post-trained with Reinforcement Learning
(RL) show impressive general reasoning, their evaluation is often confined to
language-dominant tasks (e.g., math). This raises a critical question: can RL
post-training truly extend the inherent capability boundary of a base VLM,
particularly for visual-centric spatial tasks where it initially fails? To
investigate this, we introduce Ariadne, a framework utilizing synthetic mazes
for multi-step spatial reasoning where task difficulty (e.g., path length,
turns) is precisely controlled. We leverage this controllable environment to
train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a
difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves
over 50% accuracy on a problem set where the base model scored 0%,
demonstrating that our approach expands the model's initial capability
boundary. To assess real-world viability, we evaluate out-of-distribution (OOD)
generalization on practical benchmarks. Despite training only on synthetic maze
samples, Ariadne achieves significant zero-shot improvements, averaging 16% on
MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer
tasks). These results confirm that our method not only broadens the model's
fundamental limits but also enhances its generalization to real-world spatial
reasoning. We acknowledge our study is limited to the post-training phase,
given the opaqueness of pre-training data, and hope our research motivates
further work on specialized, capability-extending alignment.