迈向动态视觉：学习基于视觉的主动视角选择

摘要

视觉语言模型（VLA）在视觉问答任务中表现出色，但始终局限于静态视觉感知——仅能基于静态图像进行推理。相比之下，具身智能体需要动态视觉能力，通过主动移动来获取信息量更丰富的视角。我们提出视觉驱动的主动视角选择任务，该任务仅利用当前图像中的视觉信息（无需场景记忆或外部知识）来选择最具信息量的下一视角。为支持此任务，我们构建了包含自动生成的配对查询-目标视角及问答提示的合成数据集，并提出通过监督微调与基于强化学习的策略优化来微调预训练VLA的框架。该方法在基于视角选择的问答任务上表现优异，并能稳健地泛化至未见的合成场景和真实场景。此外，将学习得到的VG-AVS框架集成至现有基于场景探索的具身问答系统中，可有效提升下游问答任务的准确率。

English

Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.

迈向动态视觉：学习基于视觉的主动视角选择

Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

摘要

Support