PEEK: Begeleidende en minimale beeldrepresentaties voor zero-shot generalisatie van robotmanipulatiebeleid

Samenvatting

Robotic manipulatiebeleidsregels falen vaak in generalisatie omdat ze tegelijkertijd moeten leren waar ze aandacht aan moeten besteden, welke acties ze moeten ondernemen en hoe ze deze moeten uitvoeren. Wij stellen dat hoogwaardige redenering over waar en wat kan worden overgedragen aan vision-language modellen (VLMs), waardoor beleidsregels zich kunnen specialiseren in hoe te handelen. Wij presenteren PEEK (Policy-agnostic Extraction of Essential Keypoints), dat VLMs fine-tunt om een uniforme, op punten gebaseerde tussenliggende representatie te voorspellen: 1. eindeffectorpaden die specificeren welke acties moeten worden ondernomen, en 2. taakrelevante maskers die aangeven waarop gefocust moet worden. Deze annotaties worden direct over robotobservaties gelegd, waardoor de representatie beleidsagnostisch en overdraagbaar is tussen architecturen. Om schaalbare training mogelijk te maken, introduceren we een automatische annotatiepijplijn, die gelabelde gegevens genereert over meer dan 20 robotdatasets die 9 verschillende uitvoeringen omvatten. In real-world evaluaties verbetert PEEK consistent zero-shot generalisatie, inclusief een 41,4-voudige verbetering in de echte wereld voor een 3D-beleidsregel die alleen in simulatie is getraind, en 2-3,5-voudige verbeteringen voor zowel grote VLAs als kleine manipulatiebeleidsregels. Door VLMs de semantische en visuele complexiteit te laten absorberen, voorziet PEEK manipulatiebeleidsregels van de minimale signalen die ze nodig hebben—waar, wat en hoe. Website op https://peek-robot.github.io/.

English

Robotic manipulation policies often fail to generalize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which fine-tunes VLMs to predict a unified point-based intermediate representation: 1. end-effector paths specifying what actions to take, and 2. task-relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architectures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4x real-world improvement for a 3D policy trained only in simulation, and 2-3.5x gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need--where, what, and how. Website at https://peek-robot.github.io/.

PEEK: Begeleidende en minimale beeldrepresentaties voor zero-shot generalisatie van robotmanipulatiebeleid

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

Samenvatting

Support