GeneralVLA-2：幾何学的認識再構成と制御記憶に基づくロボット計画

要旨

汎用視覚言語行動システムには、信頼性の高いロボット軌道を計画するために、オブジェクト中心の3Dエビデンスと再利用可能な操作経験が必要です。GeneralVLAは、言語入力とRGB-D観測を3Dエンドエフェクタ経路に変換する階層的インターフェースを提供しますが、2つのボトルネックが残っています。第一に、単眼SAM3Dスタイルのオブジェクト再構成は、姿勢や未観測の形状を幻覚することがあり、一方で操作は、校正済み多視点観測が利用可能な場合、安定したオブジェクト形状から恩恵を受けます。第二に、元のKnowledgeBankは主に意味的に類似したスニペットを検索し、新しい知識を追加するため、メモリの品質、競合、信頼性、幾何学的関連性を制御することが困難です。最初の課題に対処するため、我々はGeoFuse-MV3Dを導入します。これは、幾何学事前情報に基づくMV-SAM3D再構成ブランチであり、入力ビューマスクで外部幾何学的手がかりを検証し、ソフトビジュアルハルサポートを適用し、軸方向の洗練化を行い、外観を保持しながら幾何学のみを融合します。2つ目の課題に対処するため、我々はKnowledgeBankを、明示的な品質、信頼性、ライフサイクル、検証器、競合メタデータと、精度指向の検索を備えた、管理された長期記憶システムへとアップグレードします。最後に、GSO-30で再構成ブランチを、Terminal-Bench 2.0とSWE-Bench Verifiedでメモリモジュールを評価しました。GeoFuse-MV3Dは、MV-SAM3Dベースラインと比較して、CDとLPIPSをそれぞれ2.20%と2.02%削減し、PSNRとSSIMをそれぞれ2.36%と1.03%向上させました。また、KnowledgeBankは、Terminal-Bench SRで4.53%、SWE-Bench解決率で3.73%の改善をReasoningBankに対して達成し、ASをそれぞれ4.95%と5.65%削減しました。コード: https://github.com/AIGeeksGroup/GeneralVLA-2。ウェブサイト: https://aigeeksgroup.github.io/GeneralVLA-2。

English

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.