檢索，無需重新訓練：在測試時將視覺語言動作模型擴展至新任務

摘要

將視覺-語言-動作（VLA）策略擴展至新任務時，通常需要任務專屬的遠程操作示範及逐任務微調，使得適應過程在資料收集與計算成本上皆耗費資源。本文證明，可透過檢索取代目標端的逐任務適應成本。我們提出的檢索增強策略僅需在目標具身（查詢）與成本較低的具身（池，例如人手影片）所配對的示範資料上訓練一次，之後便保持固定。部署新任務時，只需將池端示範資料加入檢索池中。該固定策略在每個控制步驟皆以檢索到的軌跡為條件，因此新任務可透過索引資料而非更新參數加以吸收。僅在面對全新未見過的具身時才需微調，而非為每個新任務進行。我們證明檢索能提升超越特定骨幹架構的策略（包括標準VLA策略），但其效果在基於影片生成的世界-動作模型（WAM）「Cosmos Policy」中尤為顯著。在此設定下，檢索提供粗略任務進程，而WAM的未來影像目標則提供額外的視覺一致性訊號，強化以檢索為條件的動作。在PushT任務中，我們研究檢索如何提供可重複使用的高層級運動先驗，以實現針對未見過目標角度的跨具身泛化；而在RoboTwin 2.0上，我們的方法在未見過任務中優於跨具身基準方法，並進一步在真實機器人上驗證該方法。

English

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.