メモリ拡張型視覚言語エージェントによる持続的かつ意味的一貫性のある物体キャプショニング

要旨

視覚言語モデル（VLM）は、同一オブジェクトに対する視点間で一貫性のない記述を生成することが多く、具象エージェントが時間的経過に伴って一貫した意味表現を構築する能力を妨げている。従来手法では、オフラインのマルチビュー集約、あるいは探索・データ対応付け・キャプション学習を分離した多段階パイプラインを用いて不整合を解決していたが、過去に観測されたオブジェクトに対する推論能力は限定的であった。本論文では、データ対応付け・オブジェクトキャプション生成・探索方策を単一の自己回帰フレームワーク内で同時に扱う、メモリ拡張型の統合視覚言語エージェントを提案する。本モデルは、現在のRGB観測データ、上方図形式の探索マップ、およびオブジェクトレベルのエピソード記憶をオブジェクトレベルのトークンにシリアライズして処理し、長い時系列にわたるオブジェクトの永続的同一性と意味的一貫性を保証する。モデルを自己教師あり学習で訓練するため、写真写実的な3D環境において、不一致に基づく方策と、マルチビューのキャプション履歴間の一貫性を強化する擬似キャプション生成モデルを用いてデータセットを収集した。手動注釈によるオブジェクトレベルのテストセットを用いた詳細な評価では、標準的なキャプション評価スコアで最大+11.86%、キャプション自己類似性で+7.39%のベースラインモデルに対する改善を確認しつつ、コンパクトなシーン表現によるスケーラブルな性能を実現している。コード、モデル重み、データはhttps://hsp-iit.github.io/epos-vlm/で公開されている。

English

Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at https://hsp-iit.github.io/epos-vlm/.

メモリ拡張型視覚言語エージェントによる持続的かつ意味的一貫性のある物体キャプショニング

Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

要旨

Support