PIVOT：迭代式視覺提示引發 VLMs 的可操作知識

摘要

視覺語言模型（VLMs）展示了在各種任務中的出色能力，從邏輯推理到視覺理解。這打開了與世界進行更豐富互動的大門，例如機器人控制。然而，VLMs僅生成文本輸出，而機器人控制和其他空間任務需要輸出連續坐標、動作或軌跡。我們如何使VLMs能夠處理這些設置，而無需在特定任務數據上進行微調呢？在本文中，我們提出了一種新穎的視覺提示方法，稱為Prompting with Iterative Visual Optimization（PIVOT），將任務視為迭代視覺問答。在每個迭代中，圖像都會被標註為VLM可以參考的提案的視覺表示（例如候選機器人動作、定位或軌跡）。然後，VLM選擇最適合該任務的提案。這些提案會被迭代地改進，使VLM最終能夠找到最佳答案。我們在現實世界的機器人導航、從圖像中進行現實世界操作、模擬中的指令遵循以及其他空間推理任務（如定位）上研究了PIVOT。我們發現，或許令人驚訝的是，我們的方法實現了無需任何機器人訓練數據的零-shot控制機器人系統、在各種環境中進行導航以及其他功能。儘管目前的性能仍有很大提升空間，但我們的工作突顯了這種新模式的潛力和限制，展示了在機器人和空間推理領域中實現Internet-Scale VLMs的一種有前途的方法。網站：pivot-prompt.github.io 和 HuggingFace：https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.

English

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data? In this paper, we propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT), which casts tasks as iterative visual question answering. In each iteration, the image is annotated with a visual representation of proposals that the VLM can refer to (e.g., candidate robot actions, localizations, or trajectories). The VLM then selects the best ones for the task. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities. Although current performance is far from perfect, our work highlights potentials and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. Website: pivot-prompt.github.io and HuggingFace: https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.

PIVOT：迭代式視覺提示引發 VLMs 的可操作知識

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

摘要

Support