AURA: 一定VRAMにおけるロボットポリシーのためのアクションゲート型メモリ

要旨

KVキャッシュはデータセンター向けのメモリとしては適切であるが、ロボット向けとしては不適切である。データセンターの推論では多数の短いリクエストをバッチ処理してリセットし、注意機構のキャッシュを群衆全体で償却する。一方、身体化エージェントは帯域幅に制約のあるエッジハードウェア上で、リセットしない単一の長時間エピソードを実行する。この環境では、高帯域幅メモリやフラッシュメモリが不足し、フラッシュメモリの書き込み耐久性が有限であり、メモリ書き込みが計算処理ではなく制約要因となる可能性がある。 AURA-Mem（Action-Utility Recurrent Adaptive Memory）はこの領域を対象とする。本手法は、凍結された視覚-言語-動作バックボーンを、固定サイズのリカレントメモリと学習可能なゲートでラップする。このゲートは、現在の観測が次の動作を変化させる場合にのみ書き込みを行う。すなわち、沈黙すべきタイミングを知るメモリである。再構成ベースのメモリとは異なり、このゲートは閉ループの動作誤差信号に対して直接訓練される。その推論状態は、地平線の長さに関わらず4,224バイトに固定されている一方、KVキャッシュは10万ステップで6,061倍に拡大する。制御された合成ベンチマークでは、AURA-Memは最高性能のO(1)ベースラインと同等の精度を達成しつつ、書き込み回数を5.19～6.13倍削減し、より容易な構成では最大9.19倍削減する。予算を一致させたランダムスケジュールや周期スケジュールではこの利得は再現されず、その効果が行動驚き信号（action-surprise signal）に起因することが確認される。LIBERO-Long（各アーム60エピソード、n=60）上で訓練済みの閉ループOpenVLA-OFT 7Bパネルを用いた評価では、ゲートは成功率に悪影響を与えない。AURA-Memは非ゲートのベース方策（0.233）と同等の性能を示し、常時書き込みを行うKVアーム（0.217）をわずかに上回りつつ、書き込み回数を7.0倍削減し、メモリを一定に保つ。また、方法論の実証として、近似情報状態価値損失の上界も実装する。この規模では、この上界は保証というより空虚なものとなる。

English

The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.