再学習せずに検索：テスト時における視覚言語行動モデルの新しいタスクへの拡張

要旨

視覚-言語-行動（VLA）ポリシーを新しいタスクに拡張するには、通常、タスク固有の遠隔操作デモンストレーションとタスクごとの微調整が必要であり、データ収集と計算の両面で適応コストが高くなる。本論文では、このターゲット側のタスクごとの適応コストを検索で代替できることを示す。我々の検索拡張型ポリシーは、ターゲットとなるエンボディメント（クエリ）とより安価なエンボディメント（プール、例：人間の手の動画）からのペアデモンストレーションを用いて一度だけ訓練され、その後凍結される。新しいタスクは、展開時にプール側のデモンストレーションを検索プールに追加することで追加される。凍結されたポリシーは、各制御ステップで検索された軌跡を条件として動作するため、新しいタスクはパラメータを更新するのではなく、データのインデックス化によって吸収される。微調整が必要となるのは、新しい未知のエンボディメントに対応する場合のみであり、新しいタスクごとではない。検索は、標準的なVLAポリシーを含む特定のバックボーンを超えてポリシーを改善するが、その効果は特にビデオ生成に基づく世界行動モデル（WAM）であるCosmos Policyにおいて顕著である。この設定では、検索が粗いタスクの進行を提供する一方、WAMの将来画像目的関数が追加の視覚的一貫性シグナルを提供し、検索条件付けられた行動を強化する。PushTでは、検索が再利用可能な高レベル動作事前分布を提供し、未知の目標角度へのクロスエンボディメント一般化を実現する方法を研究する。一方、RoboTwin 2.0では、未知のタスクにおいてクロスエンボディメントベースラインを上回る性能を示し、さらに実ロボット上での手法の実証も行う。

English

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.