재학습 대신 검색: 테스트 시간에 새로운 작업을 위한 시각-언어-행동 모델 확장

초록

시각-언어-행동(VLA) 정책을 새로운 작업으로 확장하려면 일반적으로 작업별 원격 조작 시연과 작업별 미세 조정이 필요하므로, 데이터 수집과 계산 측면에서 적응 비용이 높습니다. 본 논문에서는 이러한 대상 측의 작업별 적응 비용을 검색으로 대체할 수 있음을 보여줍니다. 제안하는 검색 증강 정책은 대상 체현(질의)과 저비용 체현(풀, 예: 인간 손 비디오)의 쌍을 이룬 시연 데이터에 대해 한 번 학습된 후 고정됩니다. 새로운 작업은 배포 시 풀 측 시연을 검색 풀에 추가함으로써 추가됩니다. 고정된 정책은 매 제어 단계마다 검색된 궤적을 조건으로 하므로, 새로운 작업은 파라미터를 업데이트하는 대신 데이터를 인덱싱하여 흡수됩니다. 미세 조정은 새로운 미지의 체현을 도입할 때만 필요하며, 각각의 새로운 작업에 대해서는 필요하지 않습니다. 검색이 특정 백본을 넘어서는 정책(표준 VLA 정책 포함)을 개선하지만, 그 효과는 비디오 생성 기반 세계-행동 모델(WAM)인 Cosmos Policy에서 특히 두드러집니다. 이러한 설정에서 검색은 대략적인 작업 진행 과정을 제공하고, WAM의 미래 이미지 목표는 검색 조건화된 행동을 강화하는 추가적인 시각적 일관성 신호를 제공합니다. PushT에서는 검색이 미지의 목표 각도에 대한 교차 체현 일반화를 위해 재사용 가능한 고수준 운동 사전 정보를 제공하는 방식을 연구하고, RoboTwin 2.0에서는 제안 방법이 미지의 작업에 대해 교차 체현 기준선을 능가하며, 실제 로봇에서도 해당 방법을 시연합니다.

English

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.