迈向动态视觉:学习基于视觉的主动视角选择
Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection
December 15, 2025
作者: Juil Koo, Daehyeon Choi, Sangwoo Youn, Phillip Y. Lee, Minhyuk Sung
cs.AI
摘要
视觉语言模型(VQA)在视觉问答任务中表现出色,但仍局限于静态视觉感知,仅能基于单张图像进行推理。与之相对,具身智能体需要动态视觉能力——通过主动移动获取信息更丰富的视角。我们提出视觉驱动的主动视角选择任务,该任务仅利用当前图像中的视觉信息选择最具信息量的下一视角,无需依赖场景记忆或外部知识。为支持该研究,我们构建了包含自动生成的查询-目标视角对及问答提示词的合成数据集,并提出通过监督微调与强化学习策略优化相结合的预训练模型微调框架。该方法在基于视角选择的问答任务中表现优异,并能稳健地泛化至未见的合成场景和真实场景。此外,将学习到的VG-AVS框架集成至现有基于场景探索的EQA系统中,可有效提升下游问答任务的准确率。
English
Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.