阿尔忒弥斯：面向感知策略学习的结构化视觉推理

摘要

近期，视觉感知策略的强化学习框架开始引入自然语言表达的中间推理链。实证研究表明，这种纯语言形式的中间推理往往会降低感知任务的表现。我们认为核心问题不在于推理本身而在于推理形式：这些推理链在非结构化的语言空间进行语义推理，而视觉感知需要在以物体为中心的空间维度进行推理。为此，我们提出Artemis感知策略学习框架，其采用基于候选框的结构化推理方式——每个中间步骤均表示为可验证视觉状态的（标签，边界框）对。该设计能显式追踪中间状态，直接监督候选框质量，并规避语言推理引入的歧义。基于Qwen2.5-VL-3B构建的Artemis在定位与检测任务中表现优异，并在计数与几何感知任务上展现出强大泛化能力。多场景下的持续改进证实，将推理与空间表征对齐能有效增强感知策略学习。得益于强化后的视觉推理能力，Artemis在通用多模态大模型基准测试中也展现出竞争力，证明基于空间锚定的推理为构建可扩展、泛化性强的感知策略提供了理论路径。

English

Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.