InSight: Zelfgestuurde Vaardigheidsverwerving via Stuurbare VLA’s

Samenvatting

Visie-taal-actie (VTA) modellen kunnen manipulatievaardigheden leren van demonstraties, maar hun mogelijkheden worden begrensd door de vaardigheden in de trainingsdata. Wij presenteren InSight, een raamwerk dat autonome vaardigheidsverwerving mogelijk maakt door VTA's stuurbaar te maken op het niveau van primitieve acties (bijv. "verplaats grijper naar de kom", "til omhoog", "giet de fles"). InSight bestaat uit twee primaire fasen: (1) een geautomatiseerde segmentatiepijplijn die demonstraties opsplitst in gelabelde primitieven via VLM-plan decompositie en eind-effector poses om VTA-primitief stuurbaarheid te realiseren, en (2) een VLM-gestuurd data vliegwiel dat ontbrekende primitieven identificeert die nodig zijn om een nieuwe taak te volbrengen, autonoom probeert demonstraties van de ontbrekende primitieven uit te voeren met VLM-voorgestelde laagniveau besturing, en succesvolle demonstraties automatisch labelt, opslaat en integreert in de VTA-trainingsset. We evalueren InSight in zowel simulatie- als echte manipulatie taken, waaronder blok omdraaien, lade sluiten, vegen, draaien en gieten, zonder enige menselijke demonstraties van deze doelvaardigheden. Eenmaal geleerd, kunnen deze primitieven worden samengesteld om nieuwe, lange-termijn taken uit te voeren zonder extra menselijke demonstraties. Onze bevindingen tonen aan dat primitief stuurbaarheid een praktische basis biedt voor continue vaardigheidsverwerving in VTA-beleid. Projectwebsite: https://insight-vla.github.io.

English

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.