IVRA:基于无训练提示引导的机器人动作策略视觉-标记关系优化
IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance
January 22, 2026
作者: Jongwoo Park, Kanchana Ranasinghe, Jinhyeok Jang, Cristina Mata, Yoo Sung Jang, Michael S Ryoo
cs.AI
摘要
许多视觉-语言-动作模型将图像块展平为一维标记序列,削弱了精确操作所需的二维空间线索。我们提出IVRA——一种轻量级、免训练的方法,通过利用模型内置视觉编码器中已有的亲和性提示来增强空间理解能力,无需任何外部编码器或重新训练。IVRA选择性地将这些亲和性信号注入到包含实例级特征的语言模型层中。这种推理时干预能在保持所有模型参数固定的同时,重新校准视觉标记的交互关系,更好地保留几何结构。我们通过在涵盖2D和3D操作的模拟基准测试以及真实机器人任务中,将IVRA应用于多种VLA架构,证明了其通用性。在2D VIMA基准测试中,IVRA在低数据场景下较基线LLaRA模型将平均成功率提升4.2%;在3D LIBERO测试中,该方法对OpenVLA和FLOWER基线模型均带来稳定增益,即使在基线准确率接近饱和时仍能实现提升。所有代码与模型将公开发布,可视化结果详见:jongwoopark7978.github.io/IVRA。
English
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: jongwoopark7978.github.io/IVRA