ChatPaper.aiChatPaper

DINO-R1:激励视觉基础模型中的推理能力

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

May 29, 2025
作者: Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren
cs.AI

摘要

近期,大型語言模型(如DeepSeek-R1)的推理能力引起了爆炸性的關注,這些模型通過基於強化學習的微調框架展現了顯著的成功,其中以群組相對策略優化(GRPO)等方法為代表。然而,此類推理能力在視覺基礎模型中仍未被充分探索,尤其是在如DINO系列等表徵模型中更是明顯缺失。在本研究中,我們提出了DINO-R1,這是首次嘗試利用強化學習激勵視覺基礎模型的視覺上下文推理能力。具體而言,DINO-R1引入了群組相對查詢優化(GRQO),這是一種專為基於查詢的表徵模型設計的新穎強化式訓練策略,它根據群組歸一化的對齊質量計算查詢級獎勵。此外,我們還應用KL正則化來穩定對象性分佈,從而減少訓練的不穩定性。這種聯合優化使得查詢間能夠獲得密集且具表現力的監督,同時緩解過擬合和分佈漂移問題。基於Grounding-DINO,我們訓練了一系列DINO-R1家族模型,這些模型整合了視覺提示編碼器和視覺引導的查詢選擇機制。在COCO、LVIS和ODinW上的廣泛實驗表明,DINO-R1顯著超越了監督微調基線,在開放詞彙和封閉集視覺提示場景中均展現出強大的泛化能力。
English
The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose DINO-R1, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces Group Relative Query Optimization (GRQO), a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.
PDF274June 2, 2025