将语义概念注入图像标记以实现开放集识别。

摘要

本文介绍了Recognize Anything Plus Model（RAM++），这是一个具有强大开放集识别能力的基础图像识别模型，通过将语义概念注入图像标记训练框架中实现。先前的方法要么是受限于有限语义的图像标记模型，要么是在多标记识别中性能不佳的视觉-语言模型，交互较浅。相比之下，RAM++在基于图像-标记-文本三元组的统一细粒度交互框架内集成了图像-文本对齐和图像标记。这种设计使RAM++不仅在识别预定义类别方面表现出色，还显著增强了对开放集类别的识别能力。此外，RAM++采用大型语言模型（LLMs）生成多样的视觉标记描述，开创了将LLM知识整合到图像标记训练中的先河。这种方法赋予RAM++在推理过程中整合视觉描述概念以进行开放集识别的能力。对广泛的图像识别基准进行评估表明，RAM++在大多数方面均超越现有的基础图像识别模型的最新技术水平（SOTA）。具体而言，对于预定义的常用标记类别，RAM++在OpenImages和ImageNet上分别比CLIP提升了10.2 mAP和15.4 mAP。对于超出预定义范围的开放集类别，RAM++在OpenImages上分别比CLIP和RAM提升了5 mAP和6.4 mAP。对于多样的人-物体交互短语，RAM++在HICO基准上分别提升了7.8 mAP和4.7 mAP。代码、数据集和预训练模型可在https://github.com/xinyu1205/recognize-anything 获取。

English

In this paper, we introduce the Recognize Anything Plus Model~(RAM++), a fundamental image recognition model with strong open-set recognition capabilities, by injecting semantic concepts into image tagging training framework. Previous approaches are either image tagging models constrained by limited semantics, or vision-language models with shallow interaction for suboptimal performance in multi-tag recognition. In contrast, RAM++ integrates image-text alignment and image-tagging within a unified fine-grained interaction framework based on image-tags-text triplets. This design enables RAM++ not only excel in identifying predefined categories, but also significantly augment the recognition ability in open-set categories. Moreover, RAM++ employs large language models~(LLMs) to generate diverse visual tag descriptions, pioneering the integration of LLM's knowledge into image tagging training. This approach empowers RAM++ to integrate visual description concepts for open-set recognition during inference. Evaluations on comprehensive image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) fundamental image recognition models on most aspects. Specifically, for predefined common-used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

将语义概念注入图像标记以实现开放集识别。

Inject Semantic Concepts into Image Tagging for Open-Set Recognition

摘要

Support