DRUIF: Generaliseren van Robotbeleid via Voorkeursafstemming

Samenvatting

Ondanks de recente vooruitgang van visie-taal-actie (VLA) modellen op verschillende robotica taken, kampen ze met kritieke problemen zoals slechte generaliseerbaarheid naar ongeziene taken, vanwege hun afhankelijkheid van gedragsklonen uitsluitend van succesvolle uitvoeringen. Bovendien worden ze typisch fijnafgestemd om demonstraties van experts onder verschillende omstandigheden na te bootsen, wat distributiebias introduceert en hun aanpasbaarheid aan diverse manipulatiedoelen, zoals efficiëntie, veiligheid en taakvoltooiing, beperkt. Om deze kloof te overbruggen, introduceren we GRAPE: Generaliseren van Robotbeleid via Voorkeursafstemming. Specifiek stemt GRAPE VLA's af op trajectniveau en modelleert impliciet beloningen van zowel succesvolle als mislukte pogingen om de generaliseerbaarheid naar diverse taken te vergroten. Bovendien breekt GRAPE complexe manipulatietaken af naar onafhankelijke stadia en begeleidt automatisch voorkeursmodellering door aangepaste spatiotemporale beperkingen met keypoints voorgesteld door een groot visie-taalmodel. Opmerkelijk is dat deze beperkingen flexibel zijn en aangepast kunnen worden om het model af te stemmen op verschillende doelen, zoals veiligheid, efficiëntie of taaksucces. We evalueren GRAPE over een divers scala aan taken in zowel echte als gesimuleerde omgevingen. Experimentele resultaten tonen aan dat GRAPE de prestaties van toonaangevende VLA-modellen verbetert, waarbij de succespercentages op in-domein en ongeziene manipulatietaken respectievelijk met 51,79% en 60,36% toenemen. Bovendien kan GRAPE worden afgestemd op verschillende doelen, zoals veiligheid en efficiëntie, waarbij de botsingspercentages met 44,31% en de uitvoeringsstaplengte met 11,15% worden verminderd. Alle code, modellen en gegevens zijn beschikbaar op https://grape-vla.github.io/

English

Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at https://grape-vla.github.io/