GeneralVLA-2: 几何感知重建与受控记忆的机器人规划

摘要

通用视觉-语言-动作系统需要以物体为中心的3D证据和可复用的操作经验来规划可靠的机器人轨迹。GeneralVLA通过分层接口将语言和RGB-D观测转化为3D末端执行器路径，但仍存在两个瓶颈。首先，单目SAM3D风格的物体重建可能会产生姿态和未观测几何的幻觉，而操作中若存在校准后的多视角观测，则能从稳定的物体形状中受益。其次，原始的KnowledgeBank主要检索语义相似片段并追加新知识，难以控制记忆质量、冲突、置信度和几何相关性。针对第一个挑战，我们引入GeoFuse-MV3D——一种几何先验引导的MV-SAM3D重建分支，通过输入视角掩膜验证外部几何线索、应用软视觉外壳支撑、执行轴向精细化处理，并在保留外观的同时仅融合几何信息。针对第二个挑战，我们将KnowledgeBank升级为受控的长时记忆系统，包含显式的质量、置信度、生命周期、验证器和冲突元数据，并采用精度导向的检索策略。最后，我们在GSO-30数据集上评估重建分支，在Terminal-Bench 2.0和SWE-Bench Verified上评估记忆模块：GeoFuse-MV3D相比MV-SAM3D基线将CD和LPIPS分别降低2.20%和2.02%，同时将PSNR和SSIM提升2.36%和1.03%；KnowledgeBank相比ReasoningBank在Terminal-Bench SR上提升4.53%，在SWE-Bench解决率上提升3.73%，同时将AS分别降低4.95%和5.65%。代码地址：https://github.com/AIGeeksGroup/GeneralVLA-2。项目网站：https://aigeeksgroup.github.io/GeneralVLA-2。

English

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.