SATORI-R1: 공간적 기반과 검증 가능한 보상을 통한 다중모드 추론 유도

초록

DeepSeek-R1은 안정적인 강화 학습(RL)을 통해 텍스트 영역에서 강력한 추론 능력을 입증했습니다. 최근 멀티모달 영역에서는 RL을 직접 적용하여 Visual Question Answering(VQA) 작업을 위한 R1과 유사한 자유형 추론을 생성하는 연구가 시작되었습니다. 그러나 멀티모달 작업은 문제 해결을 위해 입력 이미지의 이해에 크게 의존한다는 점에서 텍스트 작업과 본질적으로 다른 특성을 공유합니다. 따라서 이러한 자유형 추론은 VQA 작업에서 두 가지 중요한 한계에 직면합니다: (1) 확장된 추론 체인이 작업의 핵심 영역에서 시각적 초점을 분산시켜 답변 정확도를 저하시킵니다. (2) 검증할 수 없는 중간 단계가 정책 기울기 분산과 계산 비용 오버헤드를 증폭시킵니다. 이러한 문제를 해결하기 위해 본 논문에서는 VQA를 전역 이미지 캡션 생성, 영역 위치 지정, 답변 예측이라는 세 가지 검증 가능한 단계로 분해하고 각 단계에서 명시적인 보상 신호를 제공하는 SATORI(Spatially Anchored Task Optimization with Reinforcement Learning)를 소개합니다. 또한, 학습을 용이하게 하기 위해 답변과 일치하는 캡션 및 경계 상자로 주석이 달린 12k 데이터셋인 VQA-Verify도 소개합니다. 실험 결과, 7개의 VQA 벤치마크에서 일관된 성능 향상을 보였으며, R1과 유사한 기준선에 비해 정확도에서 최대 15.7%의 향상을 달성했습니다. 주의 맵 분석 결과, 핵심 영역에 대한 초점이 강화되어 정확도가 개선되었음을 확인했습니다. 우리의 코드는 https://github.com/justairr/SATORI-R1에서 확인할 수 있습니다.

English

DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI (Spatially Anchored Task Optimization with ReInforcement Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to 15.7% improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at https://github.com/justairr/SATORI-R1.

SATORI-R1: 공간적 기반과 검증 가능한 보상을 통한 다중모드 추론 유도

SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards

초록

Support