PIVOT: 반복적 시각 프롬프팅을 통한 VLM 실행 가능 지식 도출

초록

비전 언어 모델(VLMs)은 논리적 추론부터 시각적 이해에 이르기까지 다양한 작업에서 인상적인 능력을 보여주고 있습니다. 이는 로봇 제어와 같은 세상과의 더 풍부한 상호작용의 가능성을 열어줍니다. 그러나 VLMs은 텍스트 출력만 생성하는 반면, 로봇 제어 및 기타 공간 작업은 연속적인 좌표, 동작 또는 궤적을 출력해야 합니다. 작업별 데이터에 대한 미세 조정 없이 VLMs이 이러한 설정을 처리할 수 있도록 하려면 어떻게 해야 할까요? 본 논문에서는 VLMs을 위한 새로운 시각적 프롬프팅 접근 방식을 제안합니다. 이를 '반복적 시각적 최적화를 통한 프롬프팅(PIVOT)'이라고 부르며, 이는 작업을 반복적인 시각적 질문 응답으로 캐스팅합니다. 각 반복에서 이미지는 VLMs이 참조할 수 있는 제안(예: 후보 로봇 동작, 위치 지정 또는 궤적)의 시각적 표현으로 주석 처리됩니다. 그런 다음 VLMs은 작업에 가장 적합한 제안을 선택합니다. 이러한 제안은 반복적으로 개선되어 VLMs이 결국 사용 가능한 최상의 답변에 도달할 수 있도록 합니다. 우리는 PIVOT을 실제 로봇 탐색, 이미지 기반 실제 조작, 시뮬레이션에서의 명령 수행, 그리고 위치 지정과 같은 추가적인 공간 추론 작업에 대해 조사했습니다. 놀랍게도, 우리의 접근 방식은 로봇 훈련 데이터 없이도 로봇 시스템의 제로샷 제어, 다양한 환경에서의 탐색 및 기타 기능을 가능하게 한다는 것을 발견했습니다. 현재 성능은 완벽하지 않지만, 우리의 연구는 이 새로운 체제의 잠재력과 한계를 강조하며 로봇 및 공간 추론 영역에서 인터넷 규모의 VLMs에 대한 유망한 접근 방식을 보여줍니다. 웹사이트: pivot-prompt.github.io 및 HuggingFace: https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.

English

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data? In this paper, we propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT), which casts tasks as iterative visual question answering. In each iteration, the image is annotated with a visual representation of proposals that the VLM can refer to (e.g., candidate robot actions, localizations, or trajectories). The VLM then selects the best ones for the task. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities. Although current performance is far from perfect, our work highlights potentials and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. Website: pivot-prompt.github.io and HuggingFace: https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.

PIVOT: 반복적 시각 프롬프팅을 통한 VLM 실행 가능 지식 도출

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

초록

Support