通用视觉-语言-行动模型:基于知识引导轨迹规划的可泛化多模态系统
GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning
February 4, 2026
作者: Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, Hao Tang
cs.AI
摘要
大型基础模型在视觉与语言领域已展现出对复杂问题的强大开放世界泛化能力,但机器人学领域尚未实现同等水平的泛化性能。核心挑战在于现有模型的零样本能力有限,制约了其对未见场景的有效泛化。本文提出GeneralVLA(基于知识引导轨迹规划的可泛化视觉-语言-动作模型),该分层式VLA模型能更高效利用基础模型的泛化能力,实现零样本操作并自动生成机器人训练数据。我们重点研究一类分层VLA架构:高层 affordance 分割模块经微调后可感知场景中的图像关键点可操作性;中层3D智能体执行任务理解、技能知识库调用和轨迹规划,生成指示机器人末端执行器运动轨迹的3D路径;该中间层3D路径预测结果将作为底层具备三维感知能力的控制策略的引导信号,实现精确操作。相较于现有方法,我们的技术无需真实世界机器人数据采集或人工示教,显著提升了对多样化任务及视角的扩展性。实验表明,GeneralVLA成功为14项任务生成轨迹,在VoxPoser等先进方法基础上实现显著性能提升。其生成的示教数据所训练的行为克隆策略,在鲁棒性上超越基于人工示教、VoxPoser、Scaling-up及Code-As-Policies所生成数据的训练效果。我们相信GeneralVLA有望成为兼具机器人数据生成与零样本场景下新任务求解能力的可扩展方案。代码库:https://github.com/AIGeeksGroup/GeneralVLA 项目网站:https://aigeeksgroup.github.io/GeneralVLA
English
Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.