阿尔忒弥斯:面向感知策略学习的结构化视觉推理框架
Artemis: Structured Visual Reasoning for Perception Policy Learning
December 1, 2025
作者: Wei Tang, Yanpeng Sun, Shan Zhang, Xiaofan Li, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li
cs.AI
摘要
近期,视觉感知策略的强化学习框架开始引入自然语言表达的中间推理链。实证研究表明,这种纯语言形式的中间推理往往会降低感知任务的表现。我们认为核心问题不在于推理本身而在于推理形式:现有方法在非结构化的语言空间进行语义推理,而视觉感知需要在空间化、以物体为中心的领域进行推理。为此,我们提出Artemis感知策略学习框架,其采用基于候选框的结构化推理机制——每个中间步骤以(标签,边界框)对的形式呈现,可对应可验证的视觉状态。该设计实现了中间状态的显式追踪、对候选框质量的直接监督,并规避了语言推理引入的歧义性。基于Qwen2.5-VL-3B构建的Artemis在定位与检测任务中表现优异,并在计数与几何感知任务上展现出强大泛化能力。这些多样化场景下的持续改进证实了空间表征对齐推理能增强感知策略学习。得益于强化的视觉推理能力,Artemis在通用多模态大模型基准测试中也展现出竞争力,表明基于空间锚定的推理为构建可扩展、通用型感知策略提供了原理性路径。
English
Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.