DINO-R1:激励视觉基础模型中的推理能力
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
May 29, 2025
作者: Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren
cs.AI
摘要
近期,大型语言模型(如DeepSeek-R1)的推理能力引发了爆炸性关注,其通过基于强化学习的微调框架(例如群体相对策略优化GRPO)展现了显著的成功。然而,此类推理能力在视觉基础模型,包括DINO系列等表征模型中,仍未被充分探索且明显缺失。本研究中,我们提出了DINO-R1,这是首次尝试利用强化学习激励视觉基础模型视觉上下文推理能力的探索。具体而言,DINO-R1引入了群体相对查询优化(GRQO),一种专为基于查询的表征模型设计的新型强化式训练策略,该策略依据群体归一化的对齐质量计算查询级奖励。同时,我们应用KL正则化稳定对象性分布,以减少训练的不稳定性。这种联合优化实现了跨查询的密集且富有表达力的监督,同时缓解了过拟合和分布漂移问题。基于Grounding-DINO,我们训练了一系列DINO-R1家族模型,这些模型集成了视觉提示编码器和视觉引导的查询选择机制。在COCO、LVIS和ODinW上的大量实验表明,DINO-R1显著超越了监督微调基线,在开放词汇和封闭集视觉提示场景中均实现了强大的泛化能力。
English
The recent explosive interest in the reasoning capabilities of large language
models, such as DeepSeek-R1, has demonstrated remarkable success through
reinforcement learning-based fine-tuning frameworks, exemplified by methods
like Group Relative Policy Optimization (GRPO). However, such reasoning
abilities remain underexplored and notably absent in vision foundation models,
including representation models like the DINO series. In this work, we propose
DINO-R1, the first such attempt to incentivize visual in-context
reasoning capabilities of vision foundation models using reinforcement
learning. Specifically, DINO-R1 introduces Group Relative Query
Optimization (GRQO), a novel reinforcement-style training strategy explicitly
designed for query-based representation models, which computes query-level
rewards based on group-normalized alignment quality. We also apply
KL-regularization to stabilize the objectness distribution to reduce the
training instability. This joint optimization enables dense and expressive
supervision across queries while mitigating overfitting and distributional
drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models
that integrate a visual prompt encoder and a visual-guided query selection
mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that
DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving
strong generalization in both open-vocabulary and closed-set visual prompting
scenarios.Summary
AI-Generated Summary