面向细粒度开放世界分类的特定性感知强化学习

摘要

在开放世界场景下对细粒度视觉概念进行分类（即无需预定义标签集）要求模型兼具准确性与特异性。近期出现的推理型大型多模态模型展现出强大的视觉理解能力，但在执行细粒度图像分类时往往产生过于笼统的预测。我们的初步分析表明，模型本身确实具备内在的细粒度领域知识，然而如何在保持正确预测的同时提升预测特异性，仍是一个重要但研究不足的挑战。本研究探索如何引导推理型多模态模型生成既正确又具特异性的预测。我们提出了一种新颖的特异性感知强化学习框架SpeciaRL，用于在开放世界设定下对推理型多模态模型进行细粒度图像分类的微调。该框架通过基于在线推演中最优预测的动态验证器奖励信号，在提升特异性的同时尊重模型能力以避免错误预测。跨领域实验表明，SpeciaRL在大量细粒度基准测试中实现了正确性与特异性的最佳平衡，超越了现有方法，推动了开放世界细粒度图像分类的发展。代码与模型已开源：https://github.com/s-angheben/SpeciaRL。

English

Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

面向细粒度开放世界分类的特定性感知强化学习

Specificity-aware reinforcement learning for fine-grained open-world classification

摘要

Support