AlloSpatial: Agentisches Framework für räumliches Denken in Foundation-Modellen

Zusammenfassung

Multimodale Grundlagenmodelle (Multimodal Foundation Models, MFMs) haben beachtliche Fortschritte erzielt, sind jedoch weiterhin anfällig für räumliches Denken in der physischen Welt. Ein zentraler Engpass liegt in ihrer Unfähigkeit, lokale egozentrische Beobachtungen in eine globale allozentrische räumliche Repräsentation zu überführen. Um dies zu adressieren, schlagen wir AlloSpatial vor, ein agentisches Framework für allozentrische räumliche Kognition in Grundlagenmodellen. AlloSpatial führt World2Mind ein, eine Plug-and-Play-Sandbox für kognitive Kartierung, die egozentrische Beobachtungen in strukturierte allozentrische A-priori-Informationen umwandelt, darunter allozentrisch-räumliche Bäume und Routenkarten, die Abfragen zu Objekttopologie, geometrischen Beziehungen, Passierbarkeit und Trajektorien unterstützen. Um diese A-priori-Informationen unter verrauschter Rekonstruktion und mehrdeutigen visuellen Hinweisen zuverlässig nutzen zu können, führt AlloSpatial ein Spatial Reasoning Harness (Geschirr für räumliches Schlussfolgern) ein, das Werkzeugnutzung beurteilt, modalitätsentkoppelte Hinweise sammelt und geometrisch-semantische Entscheidungen trifft. Wir verinnerlichen diesen Prozess in Qwen3-VL durch Kaltstart-Verstärkungslernen mit einer durch das Harness gesteuerten trajektorienebezogenen Belohnung. Experimente auf VSI-Bench und MindCube zeigen, dass AlloSpatial proprietäre Modelle in einer trainingsfreien Umgebung um 5 %–18 % verbessert, während allein allozentrisch-räumliche Bäume starkes räumliches Denken unterstützen, selbst wenn visuelle Eingaben entfernt werden. Die trainierten AlloSpatial-Agenten übertreffen zudem größere allgemeine Modelle und wettbewerbsfähige räumliche Baselines, was darauf hindeutet, dass strukturierte allozentrische Repräsentationen, aktive Werkzeugnutzung und überprüfbares Denken einen vielversprechenden Weg zu räumlich fähigen Grundlagenmodellen bieten.

English

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.