視覺強化微調:視覺強化學習精調
Visual-RFT: Visual Reinforcement Fine-Tuning
March 3, 2025
作者: Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
cs.AI
摘要
在大型推理模型(如OpenAI o1)中,強化微調(Reinforcement Fine-Tuning, RFT)通過對其答案的反饋進行學習,這在微調數據稀缺的應用中尤其有用。最近開源的工作如DeepSeek-R1表明,具有可驗證獎勵的強化學習是重現o1的一個關鍵方向。雖然R1風格的模型在語言模型中已展現出成功,但其在多模態領域的應用仍未被充分探索。本研究引入了視覺強化微調(Visual-RFT),進一步擴展了RFT在視覺任務中的應用範圍。具體而言,Visual-RFT首先使用大型視覺語言模型(LVLMs)為每個輸入生成包含推理標記和最終答案的多個回應,然後利用我們提出的視覺感知可驗證獎勵函數,通過策略優化算法(如群組相對策略優化,GRPO)來更新模型。我們為不同的感知任務設計了不同的可驗證獎勵函數,例如用於目標檢測的交並比(IoU)獎勵。在細粒度圖像分類、少樣本目標檢測、推理定位以及開放詞彙目標檢測基準上的實驗結果顯示,與監督微調(SFT)相比,Visual-RFT展現了競爭性的性能和先進的泛化能力。例如,在約100個樣本的一次性細粒度圖像分類中,Visual-RFT比基線提高了24.3%的準確率。在少樣本目標檢測中,Visual-RFT在COCO的兩樣本設置上超過基線21.9,在LVIS上超過15.4。我們的Visual-RFT代表了微調LVLMs的範式轉變,提供了一種數據高效、獎勵驅動的方法,增強了針對特定領域任務的推理和適應能力。
English
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1
learns from feedback on its answers, which is especially useful in applications
when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1
demonstrates that reinforcement learning with verifiable reward is one key
direction in reproducing o1. While the R1-style model has demonstrated success
in language models, its application in multi-modal domains remains
under-explored. This work introduces Visual Reinforcement Fine-Tuning
(Visual-RFT), which further extends the application areas of RFT on visual
tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs)
to generate multiple responses containing reasoning tokens and final answers
for each input, and then uses our proposed visual perception verifiable reward
functions to update the model via the policy optimization algorithm such as
Group Relative Policy Optimization (GRPO). We design different verifiable
reward functions for different perception tasks, such as the Intersection over
Union (IoU) reward for object detection. Experimental results on fine-grained
image classification, few-shot object detection, reasoning grounding, as well
as open-vocabulary object detection benchmarks show the competitive performance
and advanced generalization ability of Visual-RFT compared with Supervised
Fine-tuning (SFT). For example, Visual-RFT improves accuracy by 24.3% over
the baseline in one-shot fine-grained image classification with around 100
samples. In few-shot object detection, Visual-RFT also exceeds the baseline by
21.9 on COCO's two-shot setting and 15.4 on LVIS. Our Visual-RFT represents
a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven
approach that enhances reasoning and adaptability for domain-specific tasks.Summary
AI-Generated Summary