接地された社会的推論に向けて

要旨

慎重に組み立てられたレゴのスポーツカーが置かれた机を片付ける任務を負ったロボットを考えてみましょう。人間なら、そのスポーツカーを分解して片付けることが社会的に適切でないと認識するかもしれません。では、ロボットはどのようにしてその結論に達することができるでしょうか？大規模言語モデル（LLM）が最近、社会的推論を可能にするために使用されていますが、この推論を現実世界に根ざすことは困難でした。現実世界で推論するためには、ロボットはLLMを受動的に問い合わせるだけでなく、正しい決定をするために必要な情報を環境から*積極的に収集*する必要があります。例えば、隠れた車を検出した後、ロボットはその車がレゴで作られた高度なモデルカーなのか、幼児が作ったおもちゃの車なのかを知るために、積極的にその車を認識する必要があるかもしれません。我々は、ロボットが根ざした社会的推論を行うために環境を積極的に認識するのを助けるために、LLMと視覚言語モデル（VLM）を活用するアプローチを提案します。我々のフレームワークを大規模に評価するために、70の現実世界の表面の画像を含むMessySurfacesデータセットを公開します。さらに、我々のアプローチを2つの慎重に設計された表面でロボットを用いて説明します。積極的な認識を使用しないベースラインと比較して、MessySurfacesベンチマークで平均12.9%、ロボット実験で平均15%の改善が見られました。我々のアプローチのデータセット、コード、およびビデオはhttps://minaek.github.io/groundedsocialreasoningで見つけることができます。

English

Consider a robot tasked with tidying a desk with a meticulously constructed Lego sports car. A human may recognize that it is not socially appropriate to disassemble the sports car and put it away as part of the "tidying". How can a robot reach that conclusion? Although large language models (LLMs) have recently been used to enable social reasoning, grounding this reasoning in the real world has been challenging. To reason in the real world, robots must go beyond passively querying LLMs and *actively gather information from the environment* that is required to make the right decision. For instance, after detecting that there is an occluded car, the robot may need to actively perceive the car to know whether it is an advanced model car made out of Legos or a toy car built by a toddler. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded social reasoning. To evaluate our framework at scale, we release the MessySurfaces dataset which contains images of 70 real-world surfaces that need to be cleaned. We additionally illustrate our approach with a robot on 2 carefully designed surfaces. We find an average 12.9% improvement on the MessySurfaces benchmark and an average 15% improvement on the robot experiments over baselines that do not use active perception. The dataset, code, and videos of our approach can be found at https://minaek.github.io/groundedsocialreasoning.

接地された社会的推論に向けて

Toward Grounded Social Reasoning

要旨

Support