로보MME: 로봇 범용 정책의 메모리 벤치마킹 및 이해

초록

메모리는 장기적이고 이력에 의존하는 로봇 매니퓰레이션에 있어 핵심적입니다. 이러한 작업들은 반복된 동작을 세거나 일시적으로 가려진 객체를 조작하는 경우가 많습니다. 최근 비전-언어-행동(VLA) 모델들은 메모리 메커니즘을 통합하기 시작했지만, 그 평가는 여전히 제한적이고 비표준화된 환경에 머물러 있습니다. 이는 체계적인 이해, 비교 및 진전 측정을 제한합니다. 이러한 문제를 해결하기 위해 우리는 장기적이고 이력에 의존하는 시나리오에서 VLA 모델의 평가 및 발전을 위한 대규모 표준 벤치마크인 RoboMME를 소개합니다. 우리의 벤치마크는 시간적, 공간적, 객체, 절차적 메모리를 평가하는 신중하게 설계된 분류 체계 아래 구축된 16개의 매니퓰레이션 작업으로 구성됩니다. 우리는 또한 π0.5 백본을 기반으로 여러 통합 전략에 걸쳐 다양한 메모리 표현을 체계적으로 탐구하기 위해 14개의 메모리 강화 VLA 변형 모음집을 추가로 개발했습니다. 실험 결과는 메모리 표현의 효과가 작업에 매우 의존적이며, 각 설계가 다른 작업에 걸쳐 뚜렷한 장점과 한계를 제공함을 보여줍니다. 동영상 및 코드는 우리 웹사이트 https://robomme.github.io에서 확인할 수 있습니다.

English

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

로보MME: 로봇 범용 정책의 메모리 벤치마킹 및 이해

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

초록

Support