ChatPaper.aiChatPaper

Ariadne:探索与拓展视觉语言模型推理边界的可控框架

Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

November 1, 2025
作者: Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu
cs.AI

摘要

尽管经过强化学习(RL)后训练的视觉语言模型(VLMs)展现出令人印象深刻的通用推理能力,但其评估往往局限于语言主导型任务(如数学推理)。这引发了一个关键问题:RL后训练是否真能拓展基础VLM固有的能力边界——尤其是在模型最初无法解决的视觉中心型空间任务上?为探究此问题,我们提出Ariadne框架,该框架利用合成迷宫进行多步空间推理,并精确控制任务难度(如路径长度、转弯次数)。我们通过难度感知课程学习,在此可控环境中运用验证奖励强化学习(RLVR)对VLMs进行训练。令人惊讶的是,经过RLVR后训练后,VLM在基础模型得分为0%的问题集上准确率超过50%,证明我们的方法拓展了模型的初始能力边界。为评估实际应用潜力,我们在实用基准测试中评估了分布外(OOD)泛化能力。尽管仅使用合成迷宫样本进行训练,Ariadne在MapBench(如博物馆导航)和ReasonMap(地铁换乘任务)上分别实现了16%和24%的平均零样本提升。这些结果证实我们的方法不仅拓宽了模型的基础能力极限,还增强了其在现实世界空间推理中的泛化能力。我们承认本研究受限于预训练数据的不透明性,仅聚焦于后训练阶段,期待我们的工作能推动针对能力边界拓展的专项对齐研究。
English
While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.
PDF42December 2, 2025