AtlasVA: 自己進化型視覚スキルメモリを備えた教師不要のVLMエージェント

要旨

視覚言語モデル（VLM）エージェントは、長期的タスクにわたって経験を再利用するために記憶強化型強化学習に依存する傾向が強まっているが、既存のフレームワークのほとんどは記憶をテキストとして保存し、それを要約または洗練するためにプロプライエタリな教師モデルに依存している。この設計は空間的決定に適合しておらず、幾何学的な事前知識は損失のある言語に圧縮され、疎な相互作用はしばしば密な視覚的根拠のある信号ではなく遅延したテキストフィードバックを通じて監督されている。我々は、VLMエージェントの再利用可能な経験は視覚的に根拠づけられたままであるべきだと主張する。この洞察に基づき、我々は教師なしの視覚スキル記憶フレームワークであるAtlasVAを提案する。これは記憶を空間ヒートマップ、視覚的例示、シンボリックテキストスキルの3つの補完的な層に整理する。AtlasVAはさらに、軌跡統計と軽量なグリッドヒューリスティクスから直接危険マップと親和性マップを進化させ、これらの自己進化マップを強化学習のためのポテンシャルベースのシェイピング報酬として再利用する。これにより、外部のLLMによる監督なしで知覚、記憶、最適化が統一される。Sokoban、FrozenLake、3D具現化ナビゲーション、3Dロボット操作ベンチマークでの実験により、AtlasVAがテキスト中心の記憶ベースラインや競争力のあるVLMエージェントを一貫して上回り、特に空間集約的なタスクで顕著な向上を示している。ホームページ: https://wangpan-ustc.github.io/AtlasvaWeb

English

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose AtlasVA, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on Sokoban, FrozenLake, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb