AffordanceVLA: Ein Vision-Language-Action-Modell, das Handlungsgenerierung durch affordanzbewusstes Verständnis ermöglicht

Zusammenfassung

Vision-Language-Action (VLA)-Modelle nutzen das reichhaltige Weltwissen vortrainierter Vision-Language-Modelle (VLMs), um befolgungsbasierte robotische Manipulation zu ermöglichen. Allerdings behindert die strukturelle Diskrepanz zwischen den semantischen Räumen der VLMs und den verkörperten Steuerungspolicen häufig das Erlernen präziser Wahrnehmungs-Handlungs-Abbildungen. Um diese Herausforderung zu bewältigen, schlagen wir AffordanceVLA vor, ein einheitliches Framework, das strukturierte Affordance-Vorhersage als aufgabenorientierte Zwischenrepräsentation einführt, um eine präzisere und robustere Wahrnehmungs-Handlungs-Abbildung zu etablieren. Konkret modellieren wir Manipulations-Priors schrittweise durch drei komplementäre Komponenten: 1) Which2Act für objektzentrierte Verankerung durch visuelle latente Prädiktion zur Unterdrückung von Ablenkungen; 2) Where2Act für 2D-Interaktionslokalisierung durch Schätzung von Affordance-Karten; und 3) How2Act für 3D-geometrisches Denken zur Steuerung von Manipulationspolicen. Diese Affordance-Hinweise liefern räumlich verankerte, semantisch konditionierte und handlungsgekoppelte Zwischenrepräsentationen und überbrücken somit auf natürliche Weise Vision, Sprache und Handlung. Wir integrieren diese Module in eine Mixture-of-Transformer (MoT)-Architektur mit spezialisierten Experten und trainieren das Modell mit einer dreistufigen Trainingsstrategie und einem progressiven Daten-Curriculum. Um die Knappheit dichter Affordance-Labels in Robotikdatensätzen zu überwinden, entwickeln wir zudem eine robuste automatisierte Datenaufbereitungspipeline. Umfangreiche Experimente in Simulation und realer Umgebung zeigen, dass AffordanceVLA in verschiedenen Manipulationsszenarien eine starke Leistung erzielt.

English

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose AffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) Which2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) Where2Act for 2D interaction localization via affordance map estimation; and 3) How2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.