세분화된 개방형 세계 분류를 위한 특이성 인식 강화 학습

초록

개방형 환경(즉, 미리 정의된 레이블 집합 없이)에서 세분화된 시각 개념을 분류하려면 모델이 정확하면서도 구체적이어야 합니다. 최근의 추론 대규모 멀티모달 모델(LMM)은 강력한 시각 이해 능력을 보여주지만, 세분화된 이미지 분류를 수행할 때 지나치게 일반적인 예측을 생성하는 경향이 있습니다. 우리의 예비 분석에 따르면, 모델은 본질적으로 세분화된 도메인 지식을 보유하고 있음이 확인되었습니다. 그러나 올바른 예측(정확성)을 훼손하지 않으면서 더 구체적인 예측(구체성)을 촉진하는 것은 여전히 사소하지 않으며 충분히 연구되지 않은 과제로 남아 있습니다. 본 연구에서는 추론 LMM이 정확하고 구체적인 예측을 하도록 유도하는 방법을 탐구합니다. 우리는 개방형 환경에서 세분화된 이미지 분류에 대해 추론 LMM을 미세 조정하기 위해 새로운 구체성 인식 강화 학습 프레임워크인 SpeciaRL을 제안합니다. SpeciaRL은 온라인 롤아웃 내 최상의 예측에 기반한 동적 검증자 기반 보상 신호를 도입하여 모델의 능력을 존중하며 부정확한 예측을 방지하고 구체성을 촉진합니다. 도메인 외부 실험 결과, SpeciaRL은 다양한 세분화 벤치마크에서 정확성과 구체성 간 최상의 균형을 제공하며 기존 방법을 능가하고 개방형 세분화 이미지 분류를 발전시킴을 보여줍니다. 코드와 모델은 https://github.com/s-angheben/SpeciaRL에서 공개적으로 이용할 수 있습니다.

English

Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

세분화된 개방형 세계 분류를 위한 특이성 인식 강화 학습

Specificity-aware reinforcement learning for fine-grained open-world classification

초록

Support