PEEK:用于机器人操作策略零样本泛化的引导式最小化图像表征
PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies
September 22, 2025
作者: Jesse Zhang, Marius Memmel, Kevin Kim, Dieter Fox, Jesse Thomason, Fabio Ramos, Erdem Bıyık, Abhishek Gupta, Anqi Li
cs.AI
摘要
机器人操作策略往往难以实现泛化,因为它们需要同时学习关注何处、采取何种动作以及如何执行这些动作。我们认为,关于“何处”与“何物”的高层次推理可以交由视觉-语言模型(VLMs)处理,让策略专注于“如何”行动。我们提出了PEEK(策略无关的关键点提取),它通过微调VLMs来预测一个统一的基于点的中间表示:1. 指定采取何种动作的末端执行器路径,以及2. 指示关注何处的任务相关掩码。这些标注直接叠加在机器人观测上,使得该表示既与策略无关,又能在不同架构间迁移。为了实现可扩展的训练,我们引入了一个自动标注流程,在涵盖9种实体形态的20多个机器人数据集上生成标注数据。在现实世界的评估中,PEEK持续提升了零样本泛化能力,包括仅通过模拟训练的3D策略在现实中实现了41.4倍的性能提升,以及大型VLAs和小型操作策略分别获得了2至3.5倍的增益。通过让VLMs吸收语义和视觉的复杂性,PEEK为操作策略提供了所需的最小提示——何处、何物及如何。访问我们的网站:https://peek-robot.github.io/。
English
Robotic manipulation policies often fail to generalize because they must
simultaneously learn where to attend, what actions to take, and how to execute
them. We argue that high-level reasoning about where and what can be offloaded
to vision-language models (VLMs), leaving policies to specialize in how to act.
We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which
fine-tunes VLMs to predict a unified point-based intermediate representation:
1. end-effector paths specifying what actions to take, and 2. task-relevant
masks indicating where to focus. These annotations are directly overlaid onto
robot observations, making the representation policy-agnostic and transferable
across architectures. To enable scalable training, we introduce an automatic
annotation pipeline, generating labeled data across 20+ robot datasets spanning
9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot
generalization, including a 41.4x real-world improvement for a 3D policy
trained only in simulation, and 2-3.5x gains for both large VLAs and small
manipulation policies. By letting VLMs absorb semantic and visual complexity,
PEEK equips manipulation policies with the minimal cues they need--where, what,
and how. Website at https://peek-robot.github.io/.