GeneralVLA-2: 로봇 계획을 위한 기하학적 인식 재구성 및 제어된 메모리

초록

범용 비전-언어-행동 시스템은 신뢰할 수 있는 로봇 궤적을 계획하기 위해 객체 중심 3D 증거와 재사용 가능한 조작 경험이 필요하다. GeneralVLA는 언어 및 RGB-D 관측을 3D 말단 효과기 경로로 변환하는 계층적 인터페이스를 제공하지만, 두 가지 병목 현상이 남아 있다. 첫째, 단안 SAM3D 스타일 객체 재구성은 자세와 보이지 않는 기하를 환각할 수 있는 반면, 조작은 보정된 다중 시점 관측이 가능할 때 안정적인 객체 형상의 이점을 얻는다. 둘째, 기존 KnowledgeBank는 주로 의미적으로 유사한 스니펫을 검색하고 새로운 지식을 추가하는 방식으로, 메모리 품질, 충돌, 신뢰도 및 기하학적 관련성을 제어하기 어렵게 만든다. 첫 번째 과제를 해결하기 위해, 우리는 입력 시점 마스크로 외부 기하 단서를 검증하고, 소프트 비주얼-헐 지원을 적용하며, 축별 정제를 수행하고, 외관을 유지하면서 기하만 융합하는 기하 사전 정보 기반 MV-SAM3D 재구성 브랜치인 GeoFuse-MV3D를 도입한다. 두 번째 과제를 해결하기 위해, 우리는 KnowledgeBank를 명시적 품질, 신뢰도, 생애주기, 검증기 및 충돌 메타데이터와 함께 정밀도 지향 검색을 갖춘 관리형 장기 메모리 시스템으로 업그레이드한다. 마지막으로, 우리는 재구성 브랜치를 GSO-30에서, 메모리 모듈을 Terminal-Bench 2.0 및 SWE-Bench Verified에서 평가한다; GeoFuse-MV3D는 MV-SAM3D 기준선 대비 CD와 LPIPS를 각각 2.20% 및 2.02% 감소시키고 PSNR과 SSIM을 각각 2.36% 및 1.03% 증가시키며, KnowledgeBank는 ReasoningBank 대비 Terminal-Bench SR에서 4.53%, SWE-Bench 해결률에서 3.73% 개선하고 AS를 각각 4.95% 및 5.65% 감소시킨다. 코드: https://github.com/AIGeeksGroup/GeneralVLA-2. 웹사이트: https://aigeeksgroup.github.io/GeneralVLA-2.

English

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.