ChatPaper.aiChatPaper

苏格拉底式提问法:助力视觉语言模型解读遥感影像

Asking like Socrates: Socrates helps VLMs understand remote sensing images

November 27, 2025
作者: Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li
cs.AI

摘要

受DeepSeek-R1启发的多模态推理模型近期显著推动了视觉语言系统的发展。然而在遥感任务中,我们观察到普遍存在的伪推理现象:模型仅机械描述推理流程,而非真正基于视觉证据推导正确答案。我们将此归因于"一瞥效应"——对大尺度遥感影像的粗粒度单次感知导致理解不完整,使推理建立在语言自洽性而非视觉证据基础上。为此,我们提出RS-EoT(遥感思维证据)范式,这是一种语言驱动的迭代式视觉证据搜寻机制。为实现该范式,我们设计SocraticAgent自博弈多智能体系统,通过推理与视觉检验的交替循环生成推理轨迹。为强化并泛化该模式,我们提出两阶段渐进式强化学习策略:首先在细粒度定位任务上进行RL训练以增强RS-EoT能力,继而在遥感视觉问答任务上进行RL训练以泛化至更广泛的理解场景。实验表明RS-EoT在多个遥感视觉问答与定位基准上达到最先进性能。分析显示清晰的推理与证据搜寻迭代循环,证实RS-EoT能有效缓解一瞥效应,实现真正的证据驱动推理。相关代码、数据及模型已开源:https://geox-lab.github.io/Asking_like_Socrates
English
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates
PDF41December 3, 2025