ChatPaper.aiChatPaper

AffordanceVLA:一種透過可操作性感知理解賦能動作生成的視覺-語言-動作模型

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

June 4, 2026
作者: Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen
cs.AI

摘要

视觉-语言-动作模型(VLA)利用预训练视觉语言模型(VLM)丰富的世界知识,实现了遵循指令的机器人操作。然而,VLM语义空间与具身控制策略之间的结构不匹配,常常阻碍精确感知-动作映射的学习。为应对这一挑战,我们提出AffordanceVLA——一个统一框架,通过引入结构化可负担性预测作为任务导向的中间表示,建立更精确鲁棒的感知-动作映射。具体而言,我们通过三个互补组件逐步建模操作先验:1) Which2Act:通过视觉潜在预测实现以物体为中心的注意力聚焦,抑制干扰;2) Where2Act:通过可负担性图估计实现二维交互定位;3) How2Act:通过三维几何推理指导操作策略。这些可负担性线索提供空间锚定、语义条件化且与动作耦合的中间表示,从而自然桥接视觉、语言与动作。我们将这些模块集成到具有专门专家的混合Transformer(MoT)架构中,并采用三阶段训练策略与渐进式数据课程进行模型训练。为克服机器人数据集中密集可负担性标签的稀缺性,我们开发了鲁棒的自动化数据增强流水线。在仿真和真实场景中的大量实验表明,AffordanceVLA在多样化操作场景中均实现了强大性能。
English
Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose AffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) Which2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) Where2Act for 2D interaction localization via affordance map estimation; and 3) How2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.