ChatPaper.aiChatPaper

通用视觉语言行动模型:基于知识引导轨迹规划的可泛化多模态系统

GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

February 4, 2026
作者: Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, Hao Tang
cs.AI

摘要

大型基础模型在视觉与语言领域已展现出对复杂问题的强大开放世界泛化能力,然而在机器人学中尚未实现同等的泛化水平。核心挑战在于现有模型的零样本能力有限,这阻碍了其对未见过场景的有效泛化。本文提出GeneralVLA(基于知识引导轨迹规划的可泛化视觉-语言-动作模型),这是一种分层式视觉-语言-动作模型,能更有效地利用基础模型的泛化能力,实现零样本操作并自动生成机器人学数据。具体而言,我们研究的分层VLA模型具有以下特点:高层ASM(功能感知分割模块)经微调后可感知场景中的图像关键点功能属性;中层3DAgent负责任务理解、技能知识与轨迹规划,生成指示机器人末端执行器期望轨迹的三维路径;该中间三维路径预测结果将作为低层三维感知控制策略的指导,实现精确操作。相较于其他方法,我们的技术无需真实世界机器人数据采集或人工示范,使其对多样化任务和视角具有更强扩展性。实证表明,GeneralVLA成功为14项任务生成轨迹,显著超越VoxPoser等前沿方法。所生成的演示数据训练出的行为克隆策略,比基于人工示范或VoxPoser、Scaling-up及Code-As-Policies生成数据训练的策略更具鲁棒性。我们相信GeneralVLA有望成为兼具机器人数据生成与零样本场景下新任务解决能力的可扩展方法。代码库:https://github.com/AIGeeksGroup/GeneralVLA 项目网站:https://aigeeksgroup.github.io/GeneralVLA
English
Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.
PDF12February 17, 2026