DINO-R1：激励视觉基础模型中的推理能力

摘要

近期，大型语言模型（如DeepSeek-R1）的推理能力引发了爆炸性关注，其通过基于强化学习的微调框架（例如群体相对策略优化GRPO）展现了显著的成功。然而，此类推理能力在视觉基础模型，包括DINO系列等表征模型中，仍未被充分探索且明显缺失。本研究中，我们提出了DINO-R1，这是首次尝试利用强化学习激励视觉基础模型视觉上下文推理能力的探索。具体而言，DINO-R1引入了群体相对查询优化（GRQO），一种专为基于查询的表征模型设计的新型强化式训练策略，该策略依据群体归一化的对齐质量计算查询级奖励。同时，我们应用KL正则化稳定对象性分布，以减少训练的不稳定性。这种联合优化实现了跨查询的密集且富有表达力的监督，同时缓解了过拟合和分布漂移问题。基于Grounding-DINO，我们训练了一系列DINO-R1家族模型，这些模型集成了视觉提示编码器和视觉引导的查询选择机制。在COCO、LVIS和ODinW上的大量实验表明，DINO-R1显著超越了监督微调基线，在开放词汇和封闭集视觉提示场景中均实现了强大的泛化能力。

English

The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose DINO-R1, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces Group Relative Query Optimization (GRQO), a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.

DINO-R1：激励视觉基础模型中的推理能力

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

摘要

Support