AffordanceVLA: 一种通过可供性感知理解赋能动作生成的视觉-语言-动作模型

摘要

视觉-语言-动作（VLA）模型利用预训练视觉-语言模型（VLM）丰富的世界知识，实现指令跟随的机器人操作。然而，VLM语义空间与具身控制策略之间的结构错位往往阻碍精确感知-动作映射的学习。为解决这一挑战，我们提出AffordanceVLA——一个统一框架，通过引入结构化的可操作性预测作为任务导向的中间表征，构建更精确稳健的感知-动作映射。具体而言，我们通过三个互补组件渐进式建模操作先验：1）Which2Act：通过视觉潜在预测实现以物体为中心的语义锚定，抑制环境干扰；2）Where2Act：通过可操作图估计实现二维交互定位；3）How2Act：通过三维几何推理引导操作策略。这些可操作线索提供了空间锚定、语义约束且与动作耦合的中间表征，从而自然衔接视觉、语言与动作。我们将这些模块集成到具有专用专家的混合Transformer架构中，并采用渐进式数据课程的三阶段训练策略进行模型训练。为解决机器人数据集中密集可操作标签稀缺的问题，我们还开发了稳健的自动化数据增强流水线。在仿真与真实世界的广泛实验表明，AffordanceVLA在多种操作场景中均实现了优异性能。

English

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose AffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) Which2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) Where2Act for 2D interaction localization via affordance map estimation; and 3) How2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.