CLS-RL:基於規則的強化學習圖像分類
CLS-RL: Image Classification with Rule-Based Reinforcement Learning
March 20, 2025
作者: Ming Li, Shitian Zhao, Jike Zhong, Yuxiang Lai, Kaipeng Zhang
cs.AI
摘要
分類是機器學習中的核心任務。近期研究表明,儘管多模態大型語言模型(MLLMs)在圖像分類上初始表現不佳,但通過適量的數據進行微調,其性能可顯著提升,甚至可與當前最先進(SOTA)的分類模型相媲美。然而,獲取大規模標註數據成本高昂。本文探討了少樣本MLLM分類微調。我們發現,監督式微調(SFT)會導致嚴重的過擬合問題,甚至可能使性能低於零樣本方法。為應對這一挑戰,受基於規則的強化學習近期成功的啟發,我們提出了CLS-RL,它利用可驗證的信號作為獎勵來微調MLLMs。我們發現,在多數數據集上,CLS-RL優於SFT,且在基礎到新類及少樣本學習設定下,平均準確率顯著更高。此外,我們觀察到CLS-RL存在“免費午餐”現象;當模型在特定數據集上微調後,其在其他分佈和類名不同的數據集上的性能也可能超越零樣本模型,這表明基於強化學習的方法有效教會了模型分類的基本原理。最後,受近期推理時思考研究的啟發,我們重新審視了微調過程中的“思考過程”——這是基於強化學習方法的關鍵環節,特別是在視覺分類的背景下。我們質疑此類任務在微調期間是否需要廣泛的思考過程,提出這可能反而會損害性能。基於此前提,我們引入了No-Thinking-CLS-RL方法,該方法通過設置等值準確率獎勵,在訓練過程中最小化思考過程。我們的研究結果表明,No-Thinking-CLS-RL方法在遠少於CLS-RL的微調時間內,實現了更優的域內性能和泛化能力。
English
Classification is a core task in machine learning. Recent research has shown
that although Multimodal Large Language Models (MLLMs) are initially poor at
image classification, fine-tuning them with an adequate amount of data can
significantly enhance their performance, making them comparable to SOTA
classification models. However, acquiring large-scale labeled data is
expensive. In this paper, we explore few-shot MLLM classification fine-tuning.
We found that SFT can cause severe overfitting issues and may even degrade
performance over the zero-shot approach. To address this challenge, inspired by
the recent successes in rule-based reinforcement learning, we propose CLS-RL,
which uses verifiable signals as reward to fine-tune MLLMs. We discovered that
CLS-RL outperforms SFT in most datasets and has a much higher average accuracy
on both base-to-new and few-shot learning setting. Moreover, we observed a
free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular
dataset, their performance on other distinct datasets may also improve over
zero-shot models, even if those datasets differ in distribution and class
names. This suggests that RL-based methods effectively teach models the
fundamentals of classification. Lastly, inspired by recent works in inference
time thinking, we re-examine the `thinking process' during fine-tuning, a
critical aspect of RL-based methods, in the context of visual classification.
We question whether such tasks require extensive thinking process during
fine-tuning, proposing that this may actually detract from performance. Based
on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes
thinking processes during training by setting an equality accuracy reward. Our
findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL
method achieves superior in-domain performance and generalization capabilities
than CLS-RL.Summary
AI-Generated Summary