Ariadne：探索与拓展视觉语言模型推理边界的可控框架

摘要

尽管经过强化学习（RL）后训练的视觉语言模型（VLMs）展现出令人印象深刻的通用推理能力，但其评估往往局限于语言主导型任务（如数学推理）。这引发了一个关键问题：RL后训练是否真能拓展基础VLM固有的能力边界——尤其是在模型最初无法解决的视觉中心型空间任务上？为探究此问题，我们提出Ariadne框架，该框架利用合成迷宫进行多步空间推理，并精确控制任务难度（如路径长度、转弯次数）。我们通过难度感知课程学习，在此可控环境中运用验证奖励强化学习（RLVR）对VLMs进行训练。令人惊讶的是，经过RLVR后训练后，VLM在基础模型得分为0%的问题集上准确率超过50%，证明我们的方法拓展了模型的初始能力边界。为评估实际应用潜力，我们在实用基准测试中评估了分布外（OOD）泛化能力。尽管仅使用合成迷宫样本进行训练，Ariadne在MapBench（如博物馆导航）和ReasonMap（地铁换乘任务）上分别实现了16%和24%的平均零样本提升。这些结果证实我们的方法不仅拓宽了模型的基础能力极限，还增强了其在现实世界空间推理中的泛化能力。我们承认本研究受限于预训练数据的不透明性，仅聚焦于后训练阶段，期待我们的工作能推动针对能力边界拓展的专项对齐研究。

English

While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

Ariadne：探索与拓展视觉语言模型推理边界的可控框架

Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

摘要

Support