AffordanceVLA: Een Visie-Taal-Actiemodel dat Actiegeneratie Bevordert via Affordance-Bewust Begrip

Samenvatting

Visie-Taal-Actie (VLA)-modellen benutten de rijke wereldkennis van voorgetrainde visie-taalmodellen (VTMs) om instructievolgende robotmanipulatie mogelijk te maken. De structurele mismatch tussen VTM-semantische ruimtes en belichaamde controlebeleid belemmert echter vaak het leren van precieze perceptie-actie-koppelingen. Om deze uitdaging aan te pakken, stellen we AffordanceVLA voor, een uniform raamwerk dat gestructureerde affordance-voorspelling introduceert als een taakgerichte tussentijdse representatie om een preciezere en robuustere perceptie-actie-koppeling te vestigen. Specifiek modelleren we manipulatie-priors progressief via drie complementaire componenten: 1) Which2Act voor objectgerichte grounding via visuele latente voorspelling om afleidingen te onderdrukken; 2) Where2Act voor 2D-interactielokalisatie via affordance-kaartschatting; en 3) How2Act voor 3D-geometrische redenering om manipulatiebeleid te sturen. Deze affordance-aanwijzingen bieden ruimtelijk gegronde, semantisch geconditioneerde en actiegekoppelde tussentijdse representaties, waardoor ze op natuurlijke wijze visie, taal en actie overbruggen. We integreren deze modules in een Mengsel-van-Transformers (MoT)-architectuur met gespecialiseerde experts en trainen het model met een drietraps trainingsstrategie met een progressief datacurriculum. Om de schaarste aan dichte affordance-labels in robotdatasets te overwinnen, ontwikkelen we ook een robuuste geautomatiseerde data-augmentatiepijplijn. Uitgebreide experimenten op simulatie en de echte wereld tonen aan dat AffordanceVLA sterke prestaties levert in diverse manipulatiescenario's.

English

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose AffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) Which2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) Where2Act for 2D interaction localization via affordance map estimation; and 3) How2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.