檢索,無需重新訓練:在測試時將視覺語言動作模型擴展至新任務
Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time
June 14, 2026
作者: Jeongeun Park, Juhan Park, Taekyung Kim, Sungjoon Choi, Dongyoon Han, Sangdoo Yun
cs.AI
摘要
將視覺-語言-動作(VLA)策略擴展至新任務時,通常需要任務專屬的遠程操作示範及逐任務微調,使得適應過程在資料收集與計算成本上皆耗費資源。本文證明,可透過檢索取代目標端的逐任務適應成本。我們提出的檢索增強策略僅需在目標具身(查詢)與成本較低的具身(池,例如人手影片)所配對的示範資料上訓練一次,之後便保持固定。部署新任務時,只需將池端示範資料加入檢索池中。該固定策略在每個控制步驟皆以檢索到的軌跡為條件,因此新任務可透過索引資料而非更新參數加以吸收。僅在面對全新未見過的具身時才需微調,而非為每個新任務進行。我們證明檢索能提升超越特定骨幹架構的策略(包括標準VLA策略),但其效果在基於影片生成的世界-動作模型(WAM)「Cosmos Policy」中尤為顯著。在此設定下,檢索提供粗略任務進程,而WAM的未來影像目標則提供額外的視覺一致性訊號,強化以檢索為條件的動作。在PushT任務中,我們研究檢索如何提供可重複使用的高層級運動先驗,以實現針對未見過目標角度的跨具身泛化;而在RoboTwin 2.0上,我們的方法在未見過任務中優於跨具身基準方法,並進一步在真實機器人上驗證該方法。
English
Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.