Verso un Ragionamento Sociale Radicato

Abstract

Consideriamo un robot incaricato di riordinare una scrivania su cui si trova una meticolosamente costruita macchina sportiva Lego. Un essere umano potrebbe riconoscere che non è socialmente appropriato smontare la macchina sportiva e metterla via come parte del "riordino". Come può un robot giungere a questa conclusione? Sebbene i grandi modelli linguistici (LLM) siano stati recentemente utilizzati per abilitare il ragionamento sociale, radicare questo ragionamento nel mondo reale si è rivelato impegnativo. Per ragionare nel mondo reale, i robot devono andare oltre il semplice interrogare passivamente gli LLM e *raccogliere attivamente informazioni dall'ambiente* necessarie per prendere la decisione corretta. Ad esempio, dopo aver rilevato che c'è una macchina occlusa, il robot potrebbe aver bisogno di percepire attivamente la macchina per sapere se si tratta di un modello avanzato di macchina fatta di Lego o di una macchinina costruita da un bambino. Proponiamo un approccio che sfrutta un LLM e un modello linguistico visivo (VLM) per aiutare un robot a percepire attivamente il proprio ambiente e svolgere un ragionamento sociale radicato. Per valutare la nostra struttura su larga scala, rilasciamo il dataset MessySurfaces che contiene immagini di 70 superfici del mondo reale che devono essere pulite. Inoltre, illustriamo il nostro approccio con un robot su 2 superfici progettate con cura. Rileviamo un miglioramento medio del 12,9% sul benchmark MessySurfaces e un miglioramento medio del 15% negli esperimenti con il robot rispetto ai baseline che non utilizzano la percezione attiva. Il dataset, il codice e i video del nostro approccio possono essere trovati all'indirizzo https://minaek.github.io/groundedsocialreasoning.

English

Consider a robot tasked with tidying a desk with a meticulously constructed Lego sports car. A human may recognize that it is not socially appropriate to disassemble the sports car and put it away as part of the "tidying". How can a robot reach that conclusion? Although large language models (LLMs) have recently been used to enable social reasoning, grounding this reasoning in the real world has been challenging. To reason in the real world, robots must go beyond passively querying LLMs and *actively gather information from the environment* that is required to make the right decision. For instance, after detecting that there is an occluded car, the robot may need to actively perceive the car to know whether it is an advanced model car made out of Legos or a toy car built by a toddler. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded social reasoning. To evaluate our framework at scale, we release the MessySurfaces dataset which contains images of 70 real-world surfaces that need to be cleaned. We additionally illustrate our approach with a robot on 2 carefully designed surfaces. We find an average 12.9% improvement on the MessySurfaces benchmark and an average 15% improvement on the robot experiments over baselines that do not use active perception. The dataset, code, and videos of our approach can be found at https://minaek.github.io/groundedsocialreasoning.

Verso un Ragionamento Sociale Radicato

Toward Grounded Social Reasoning

Abstract

Support