공간 시각 언어 모델에서 이중 경로 추론 강화

초록

공간 VLM은 기하학적 인식에서 상당한 진전을 이루었지만, 깊이, 거리, 장면 관계에 대한 다단계 추론을 필요로 하는 복잡한 공간 추론은 여전히 어려움을 겪고 있다. 또한, 서로 다른 공간 질의는 근본적으로 다른 전략을 요구한다. 일부는 순수 언어적 단계별 추론을 통해 가장 잘 처리되는 반면, 다른 것들은 정량적 추론 전에 명시적인 3D 접지(grounding)가 필요하다. 우리는 공간 VLM을 위한 강화학습 기반 이중 경로 공간 추론(SR-REAL)을 제시한다. 이는 공간 VLM에 두 가지 상호 보완적인 추론 경로를 제공하는 통합 프레임워크이다: 단계별 언어적 추론을 수행하는 언어 전용 추론(LOR)과 명시적 기하학적 추론 전에 영역 토큰을 통해 3D 기하학적 단서(예: 중심점 또는 경계 상자)를 감지하는 감지 후 추론(DTR)이다. SR-REAL은 LOR과 DTR의 사고 연쇄(chain-of-thought) 지도 학습을 구성하고 영역-3D 인터페이스를 노출하는 콜드 스타트 지도 미세 조정 단계로 시작하며, 이후 정확도 및 형식 보상을 통해 정책 모델을 최적화하는 강화학습(RL)이 이어진다. DTR의 경우, 이산적 중심 기반 감지 보상이 기하학적 정렬을 더욱 세분화한다. 다양한 공간 벤치마크에서 SR-REAL은 공간 VLM 기준선을 크게 능가한다: (i) 단일 RL 훈련 모델이 두 추론 경로를 모두 지원하며, DTR은 정밀한 3D 위치 파악을 통해 영역 인식 작업에서 우수하고 LOR은 일반 공간 추론을 향상시킨다; (ii) 두 경로를 함께 훈련하면 상호 강화를 촉진한다; (iii) 고품질의 혼합된 콜드 스타트 데이터가 안정적인 RL 최적화에 중요하다; (iv) 모델은 작업별 조정 없이 데이터셋과 도메인 전반에 걸쳐 일반화되며, LOR과 DTR 간의 긍정적 전이를 보여준다.

English

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.