Ragionamento nello spazio attraverso il radicamento nel mondo

Abstract

In questo articolo, affermiamo che il grounding visivo 3D è la pietra angolare del ragionamento spaziale e introduciamo il Grounded-Spatial Reasoner (GS-Reasoner) per esplorare le rappresentazioni spaziali efficaci che colmano il divario tra di essi. Gli attuali LLM 3D soffrono dell'assenza di una rappresentazione 3D unificata in grado di catturare congiuntamente informazioni semantiche e geometriche. Questa carenza si manifesta sia in scarse prestazioni nel grounding che in un'eccessiva dipendenza da moduli esterni, ostacolando infine l'integrazione senza soluzione di continuità tra grounding e ragionamento spaziale. Per affrontare questo problema, proponiamo un meccanismo di pooling a doppio percorso semplice ma efficace che allinea strettamente le caratteristiche geometriche con i segnali sia semantici che posizionali, costruendo una rappresentazione 3D unificata basata su patch di immagine che racchiude tutte le informazioni essenziali senza aumentare il numero di token di input. Sfruttando questa rappresentazione olistica, GS-Reasoner è il primo LLM 3D che raggiunge il grounding autoregressivo interamente senza moduli esterni, offrendo prestazioni paragonabili ai modelli all'avanguardia e stabilendo un framework unificato e autonomo per il ragionamento spaziale 3D. Per ulteriormente colmare il divario tra grounding e ragionamento spaziale, introduciamo il dataset Grounded Chain-of-Thought (GCoT). Questo dataset è meticolosamente curato per includere sia annotazioni di bounding box 3D per gli oggetti referenziati nelle domande di ragionamento che percorsi di ragionamento passo-passo che integrano il grounding come componente centrale del processo di risoluzione dei problemi. Esperimenti estensivi dimostrano che GS-Reasoner ottiene risultati impressionanti nel grounding visivo 3D, che a sua volta migliora significativamente le sue capacità di ragionamento spaziale, portando a prestazioni all'avanguardia.

English

In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.

Ragionamento nello spazio attraverso il radicamento nel mondo

Reasoning in Space via Grounding in the World

Abstract

Support