CLS-RL: 규칙 기반 강화 학습을 활용한 이미지 분류

초록

분류(Classification)는 머신러닝의 핵심 과제 중 하나입니다. 최근 연구에 따르면, 멀티모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)은 초기에는 이미지 분류에서 성능이 낮지만, 적절한 양의 데이터로 미세 조정(fine-tuning)을 수행하면 성능이 크게 향상되어 SOTA(State-of-the-Art) 분류 모델과 비슷한 수준까지 도달할 수 있음이 밝혀졌습니다. 그러나 대규모 레이블 데이터를 확보하는 것은 비용이 많이 듭니다. 본 논문에서는 소량의 데이터를 활용한 MLLM 분류 미세 조정(few-shot MLLM classification fine-tuning)을 탐구합니다. 우리는 SFT(Supervised Fine-Tuning)가 심각한 과적합 문제를 일으키고, 심지어 제로샷(zero-shot) 접근법보다 성능이 저하될 수 있음을 발견했습니다. 이러한 문제를 해결하기 위해, 최근 규칙 기반 강화 학습(rule-based reinforcement learning)의 성공 사례에서 영감을 받아 검증 가능한 신호를 보상으로 사용하여 MLLM을 미세 조정하는 CLS-RL을 제안합니다. 우리는 CLS-RL이 대부분의 데이터셋에서 SFT를 능가하며, 기본-새로운(base-to-new) 및 소량 학습(few-shot learning) 설정에서 훨씬 높은 평균 정확도를 보임을 발견했습니다. 또한, CLS-RL에서 무료 점심 현상(free-lunch phenomenon)을 관찰했습니다. 특정 데이터셋에서 모델을 미세 조정할 때, 분포와 클래스 이름이 다른 다른 데이터셋에서도 제로샷 모델보다 성능이 향상될 수 있다는 것입니다. 이는 RL 기반 방법이 모델에게 분류의 기본 원리를 효과적으로 가르친다는 것을 시사합니다. 마지막으로, 최근 추론 시간 사고(inference time thinking) 연구에서 영감을 받아, 시각적 분류 맥락에서 RL 기반 방법의 중요한 측면인 미세 조정 중의 '사고 과정(thinking process)'을 재검토합니다. 우리는 이러한 과제가 미세 조정 중에 광범위한 사고 과정을 필요로 하는지 의문을 제기하며, 이는 오히려 성능을 저하시킬 수 있다고 제안합니다. 이를 바탕으로, 우리는 사고 과정을 최소화하기 위해 동등 정확도 보상(equality accuracy reward)을 설정하는 No-Thinking-CLS-RL 방법을 소개합니다. 우리의 연구 결과는, 훨씬 적은 미세 조정 시간으로도 No-Thinking-CLS-RL 방법이 CLS-RL보다 우수한 도메인 내 성능과 일반화 능력을 달성함을 보여줍니다.

English

Classification is a core task in machine learning. Recent research has shown that although Multimodal Large Language Models (MLLMs) are initially poor at image classification, fine-tuning them with an adequate amount of data can significantly enhance their performance, making them comparable to SOTA classification models. However, acquiring large-scale labeled data is expensive. In this paper, we explore few-shot MLLM classification fine-tuning. We found that SFT can cause severe overfitting issues and may even degrade performance over the zero-shot approach. To address this challenge, inspired by the recent successes in rule-based reinforcement learning, we propose CLS-RL, which uses verifiable signals as reward to fine-tune MLLMs. We discovered that CLS-RL outperforms SFT in most datasets and has a much higher average accuracy on both base-to-new and few-shot learning setting. Moreover, we observed a free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular dataset, their performance on other distinct datasets may also improve over zero-shot models, even if those datasets differ in distribution and class names. This suggests that RL-based methods effectively teach models the fundamentals of classification. Lastly, inspired by recent works in inference time thinking, we re-examine the `thinking process' during fine-tuning, a critical aspect of RL-based methods, in the context of visual classification. We question whether such tasks require extensive thinking process during fine-tuning, proposing that this may actually detract from performance. Based on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes thinking processes during training by setting an equality accuracy reward. Our findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL method achieves superior in-domain performance and generalization capabilities than CLS-RL.

CLS-RL: 규칙 기반 강화 학습을 활용한 이미지 분류

CLS-RL: Image Classification with Rule-Based Reinforcement Learning

초록

Support