PIVOT：迭代式视觉提示引发可操作的知识，用于VLMs

摘要

视觉语言模型（VLMs）展现了在各种任务中的出色能力，从逻辑推理到视觉理解。这为与世界更丰富地互动打开了大门，例如机器人控制。然而，VLMs只生成文本输出，而机器人控制和其他空间任务需要输出连续坐标、动作或轨迹。我们如何使VLMs能够处理这些设置，而无需在特定任务数据上进行微调呢？在本文中，我们提出了一种新颖的视觉提示方法，称为Prompting with Iterative Visual Optimization（PIVOT），将任务构建为迭代式视觉问答。在每次迭代中，图像用提案的视觉表示进行注释，VLM可以参考这些提案（例如候选机器人动作、定位或轨迹）。然后，VLM选择最适合任务的提案。这些提案经过迭代精炼，使VLM最终能够找到最佳答案。我们在真实世界的机器人导航、图像中的真实世界操作、模拟中的指令遵循以及其他空间推理任务（如定位）上研究了PIVOT。我们发现，或许令人惊讶的是，我们的方法实现了零-shot控制机器人系统，无需任何机器人训练数据，在各种环境中导航以及其他功能。尽管当前性能远非完美，但我们的工作突显了这种新模式的潜力和局限性，并展示了在机器人和空间推理领域中实现Internet规模VLMs的一种有前途的方法。网站：pivot-prompt.github.io 和 HuggingFace：https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo。

English

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data? In this paper, we propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT), which casts tasks as iterative visual question answering. In each iteration, the image is annotated with a visual representation of proposals that the VLM can refer to (e.g., candidate robot actions, localizations, or trajectories). The VLM then selects the best ones for the task. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities. Although current performance is far from perfect, our work highlights potentials and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. Website: pivot-prompt.github.io and HuggingFace: https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.

PIVOT：迭代式视觉提示引发可操作的知识，用于VLMs

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

摘要

Support