朝向基於實證的社交推理

摘要

考慮一個機器人的任務是整理一個精心構建的樂高運動汽車的桌子。人類可能會意識到將運動汽車拆解並放好是不符合社交禮儀的。機器人如何得出這個結論呢？儘管最近大型語言模型（LLMs）已被用於實現社交推理，但將這種推理基於現實世界仍然具有挑戰性。為了在現實世界中進行推理，機器人必須超越被動地向LLMs查詢，*主動從環境中收集信息*，以做出正確的決定。例如，在檢測到有一輛被遮蔽的汽車後，機器人可能需要主動感知汽車，以了解它是由樂高製成的高級型號汽車，還是由幼兒製作的玩具車。我們提出了一種方法，利用LLM和視覺語言模型（VLM）來幫助機器人主動感知其環境，從而進行基於現實的社交推理。為了在規模上評估我們的框架，我們發布了MessySurfaces數據集，其中包含70個需要清潔的現實世界表面的圖像。我們還通過一個機器人在2個精心設計的表面上展示了我們的方法。我們發現MessySurfaces基準測試平均提高了12.9％，機器人實驗平均提高了15％，相對於不使用主動感知的基線。可以在https://minaek.github.io/groundedsocialreasoning找到數據集、代碼和我們方法的視頻。

English

Consider a robot tasked with tidying a desk with a meticulously constructed Lego sports car. A human may recognize that it is not socially appropriate to disassemble the sports car and put it away as part of the "tidying". How can a robot reach that conclusion? Although large language models (LLMs) have recently been used to enable social reasoning, grounding this reasoning in the real world has been challenging. To reason in the real world, robots must go beyond passively querying LLMs and *actively gather information from the environment* that is required to make the right decision. For instance, after detecting that there is an occluded car, the robot may need to actively perceive the car to know whether it is an advanced model car made out of Legos or a toy car built by a toddler. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded social reasoning. To evaluate our framework at scale, we release the MessySurfaces dataset which contains images of 70 real-world surfaces that need to be cleaned. We additionally illustrate our approach with a robot on 2 carefully designed surfaces. We find an average 12.9% improvement on the MessySurfaces benchmark and an average 15% improvement on the robot experiments over baselines that do not use active perception. The dataset, code, and videos of our approach can be found at https://minaek.github.io/groundedsocialreasoning.

朝向基於實證的社交推理

Toward Grounded Social Reasoning

摘要

Support