ChatPaper.aiChatPaper

VLS:通过视觉语言模型引导预训练机器人策略

VLS: Steering Pretrained Robot Policies via Vision-Language Models

February 3, 2026
作者: Shuo Liu, Ishneet Sukhvinder Singh, Yiqing Xu, Jiafei Duan, Ranjay Krishna
cs.AI

摘要

为何预训练的扩散模型或流匹配策略在障碍物附近、偏移支撑面上或轻度杂乱环境中执行相同任务时会失败?这类故障很少源于运动技能的缺失,反而暴露出模仿学习在训练-测试分布偏移下的局限性——动作生成与训练时特定的空间配置和任务规范紧密耦合。通过重新训练或微调来解决这些问题不仅成本高昂,更存在概念上的错位,因为所需的行为本就存在,只是无法在测试时被选择性适配。我们提出视觉语言引导框架(VLS),这是一种无需重新训练的冻结生成式机器人策略推理时适配方法。VLS将适配视为推理时的控制问题,通过引导预训练扩散/流匹配策略的采样过程来响应分布外观察-语言输入,且无需修改策略参数。该框架利用视觉语言模型合成轨迹可微的奖励函数,引导去噪过程生成满足测试时空间与任务要求的动作轨迹。在仿真与真实环境评估中,VLS持续优于现有引导方法,在CALVIN基准上提升31%性能,在LIBERO-PRO任务集上获得13%增益。Franka机器人的真实部署进一步验证了该方法在测试时空间与语义偏移下的鲁棒适配能力。项目页面:https://vision-language-steering.github.io/webpage/
English
Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: https://vision-language-steering.github.io/webpage/
PDF171February 6, 2026