检索，无需重新训练：在测试时将视觉语言动作模型扩展至新任务

摘要

将视觉-语言-动作（VLA）策略拓展至新任务通常需要特定任务的遥操作演示及逐任务微调，使得数据采集与计算两方面的适配成本高昂。本文证明，这种目标端逐任务适配成本可通过检索替代。我们提出的检索增强策略仅需在目标载体（查询）与低成本载体（池，如人手视频）的配对演示数据上训练一次，之后便固定不变。部署时，通过将池端演示添加至检索库即可纳入新任务。该冻结策略在每个控制步骤均以检索到的轨迹为条件，因此新任务通过索引数据而非更新参数来吸收。仅当面对全新未知载体时才需微调，而无需为每个新任务重复此过程。我们证明检索对策略的增强效果不限于特定基础架构（包括标准VLA策略），但在基于视频生成的世界-动作模型（WAM）Cosmos Policy中尤为显著。在此设定下，检索提供粗粒度的任务推进，而WAM的未来图像目标则提供额外的视觉一致性信号，强化检索条件化的动作。在PushT任务中，我们研究了检索如何为跨载体泛化至未见目标角度提供可复用的高层运动先验；在RoboTwin 2.0任务中，我们的方法在未见任务上超越跨载体基线，并在真实机器人上验证了该方法。

English

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.