근거 있는 사회적 추론을 향하여

초록

정교하게 조립된 레고 스포츠카가 있는 책상을 정리하는 임무를 맡은 로봇을 생각해 보자. 인간은 이 스포츠카를 분해하여 "정리"의 일부로 치우는 것이 사회적으로 적절하지 않다는 것을 인지할 수 있다. 로봇은 어떻게 그런 결론에 도달할 수 있을까? 최근 대형 언어 모델(LLM)이 사회적 추론을 가능하게 하는 데 사용되고 있지만, 이러한 추론을 현실 세계에 적용하는 것은 어려운 과제였다. 현실 세계에서 추론하기 위해 로봇은 LLM에 수동적으로 질의하는 것을 넘어, 올바른 결정을 내리기 위해 필요한 정보를 *환경에서 능동적으로 수집*해야 한다. 예를 들어, 가려진 자동차가 있다는 것을 감지한 후, 로봇은 그것이 레고로 만들어진 고급 모델카인지 아니면 유아가 만든 장난감 자동차인지 알기 위해 능동적으로 자동차를 인지해야 할 수 있다. 우리는 로봇이 현실 세계에 기반한 사회적 추론을 수행하기 위해 환경을 능동적으로 인지하도록 돕기 위해 LLM과 시각 언어 모델(VLM)을 활용하는 접근 방식을 제안한다. 우리의 프레임워크를 대규모로 평가하기 위해, 정리가 필요한 70개의 실제 세계 표면 이미지를 포함한 MessySurfaces 데이터셋을 공개한다. 또한, 우리는 신중하게 설계된 2개의 표면에서 로봇을 통해 우리의 접근 방식을 시연한다. 능동적 인지를 사용하지 않는 베이스라인 대비 MessySurfaces 벤치마크에서 평균 12.9%의 개선을, 로봇 실험에서 평균 15%의 개선을 확인했다. 우리의 접근 방식에 대한 데이터셋, 코드, 비디오는 https://minaek.github.io/groundedsocialreasoning에서 확인할 수 있다.

English

Consider a robot tasked with tidying a desk with a meticulously constructed Lego sports car. A human may recognize that it is not socially appropriate to disassemble the sports car and put it away as part of the "tidying". How can a robot reach that conclusion? Although large language models (LLMs) have recently been used to enable social reasoning, grounding this reasoning in the real world has been challenging. To reason in the real world, robots must go beyond passively querying LLMs and *actively gather information from the environment* that is required to make the right decision. For instance, after detecting that there is an occluded car, the robot may need to actively perceive the car to know whether it is an advanced model car made out of Legos or a toy car built by a toddler. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded social reasoning. To evaluate our framework at scale, we release the MessySurfaces dataset which contains images of 70 real-world surfaces that need to be cleaned. We additionally illustrate our approach with a robot on 2 carefully designed surfaces. We find an average 12.9% improvement on the MessySurfaces benchmark and an average 15% improvement on the robot experiments over baselines that do not use active perception. The dataset, code, and videos of our approach can be found at https://minaek.github.io/groundedsocialreasoning.

근거 있는 사회적 추론을 향하여

Toward Grounded Social Reasoning

초록

Support