AURA: 일정 VRAM에서의 로봇 정책을 위한 행동-게이트 메모리

초록

KV-cache는 데이터센터에 적합한 메모리이지만, 로봇에는 부적합한 메모리이다. 데이터센터 추론은 여러 개의 짧은 요청을 배치 처리하고 이를 초기화함으로써 어텐션 캐시를 다수의 요청에 분산시킨다. 반면, 임베디드 에이전트는 대역폭이 제한된 엣지 하드웨어에서 재설정 없이 하나의 긴 에피소드를 실행하며, 고대역폭 메모리와 플래시가 부족하고, 플래시의 쓰기 내구성에 제한이 있으며, 연산보다는 메모리 쓰기가 병목 제약 조건이 될 수 있다. AURA-Mem(Action-Utility Recurrent Adaptive Memory)은 이러한 환경을 대상으로 한다. 이는 고정된 시각-언어-행동 백본을 상수 크기의 순환 메모리와 학습된 게이트로 감싸며, 이 게이트는 현재 관측이 다음 행동을 변경할 때만 쓰기를 수행한다. 즉, 침묵할 시점을 아는 메모리이다. 재구성 기반 메모리와 달리, 이 게이트는 폐쇄 루프 행동 오류 신호에 대해 직접 학습된다. 추론 상태는 지평선 길이에 관계없이 4,224바이트로 고정되는 반면, KV-cache는 100,000 단계에서 6,061배 더 커진다. 통제된 합성 벤치마크에서 AURA-Mem은 최고의 O(1) 기준 모델과 동등한 정확도를 유지하면서도 쓰기 횟수를 5.19~6.13배 줄였으며, 더 쉬운 구성에서는 최대 9.19배까지 감소시켰다. 예산이 일치된 무작위 및 주기적 스케줄은 이러한 이득을 회복하지 못했으며, 이는 행동-놀라움 신호의 이점을 입증한다. LIBERO-Long에서 훈련된 폐쇄 루프 OpenVLA-OFT 7B 패널(팔당 60개 에피소드, n=60)에서 게이트는 성공률에 해를 끼치지 않았다. AURA-Mem은 게이트가 없는 기본 정책(0.233)과 일치하고, 항상 쓰기를 수행하는 KV 팔(0.217)을 약간 상회하면서도 쓰기 횟수를 7.0배 줄이고 일정한 메모리를 유지한다. 또한 방법론 시연으로 근사 정보 상태 가치 손실 상한을 구현하였으며, 이 규모에서 해당 상한은 보장이라기보다는 무효한 값이다.

English

The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.