IVRA:基于无训练提示引导的机器人动作策略视觉-标记关系优化
IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance
January 22, 2026
作者: Jongwoo Park, Kanchana Ranasinghe, Jinhyeok Jang, Cristina Mata, Yoo Sung Jang, Michael S Ryoo
cs.AI
摘要
现有许多视觉-语言-动作模型将图像块展平为一维标记序列,削弱了精确操作所需的二维空间线索。我们提出IVRA——一种轻量级、无需训练的方法,通过利用模型内置视觉编码器中已有的亲和度提示来增强空间理解能力,无需任何外部编码器或重新训练。IVRA选择性地将这些亲和度信号注入到包含实例级特征的语言模型层中。这种推理时干预能在保持所有模型参数固定的同时,重新校准视觉标记的交互关系并更好地保留几何结构。我们在涵盖二维与三维操作的仿真基准(VIMA和LIBERO)及多种真实机器人任务中,通过对不同VLA架构(LLaRA、OpenVLA和FLOWER)的应用验证了IVRA的通用性。在二维VIMA任务中,IVRA在低数据场景下较基线LLaRA平均成功率提升4.2%;在三维LIBERO任务中,对OpenVLA和FLOWER基线模型均带来持续增益,即使在基线准确率接近饱和时仍能提升(96.3%至97.1%)。所有代码与模型将公开发布,可视化结果详见:jongwoopark7978.github.io/IVRA
English
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: jongwoopark7978.github.io/IVRA