将语义概念注入图像标记以实现开放集识别。
Inject Semantic Concepts into Image Tagging for Open-Set Recognition
October 23, 2023
作者: Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, Lei Zhang
cs.AI
摘要
本文介绍了Recognize Anything Plus Model(RAM++),这是一个具有强大开放集识别能力的基础图像识别模型,通过将语义概念注入图像标记训练框架中实现。先前的方法要么是受限于有限语义的图像标记模型,要么是在多标记识别中性能不佳的视觉-语言模型,交互较浅。相比之下,RAM++在基于图像-标记-文本三元组的统一细粒度交互框架内集成了图像-文本对齐和图像标记。这种设计使RAM++不仅在识别预定义类别方面表现出色,还显著增强了对开放集类别的识别能力。此外,RAM++采用大型语言模型(LLMs)生成多样的视觉标记描述,开创了将LLM知识整合到图像标记训练中的先河。这种方法赋予RAM++在推理过程中整合视觉描述概念以进行开放集识别的能力。对广泛的图像识别基准进行评估表明,RAM++在大多数方面均超越现有的基础图像识别模型的最新技术水平(SOTA)。具体而言,对于预定义的常用标记类别,RAM++在OpenImages和ImageNet上分别比CLIP提升了10.2 mAP和15.4 mAP。对于超出预定义范围的开放集类别,RAM++在OpenImages上分别比CLIP和RAM提升了5 mAP和6.4 mAP。对于多样的人-物体交互短语,RAM++在HICO基准上分别提升了7.8 mAP和4.7 mAP。代码、数据集和预训练模型可在https://github.com/xinyu1205/recognize-anything 获取。
English
In this paper, we introduce the Recognize Anything Plus Model~(RAM++), a
fundamental image recognition model with strong open-set recognition
capabilities, by injecting semantic concepts into image tagging training
framework. Previous approaches are either image tagging models constrained by
limited semantics, or vision-language models with shallow interaction for
suboptimal performance in multi-tag recognition. In contrast, RAM++ integrates
image-text alignment and image-tagging within a unified fine-grained
interaction framework based on image-tags-text triplets. This design enables
RAM++ not only excel in identifying predefined categories, but also
significantly augment the recognition ability in open-set categories. Moreover,
RAM++ employs large language models~(LLMs) to generate diverse visual tag
descriptions, pioneering the integration of LLM's knowledge into image tagging
training. This approach empowers RAM++ to integrate visual description concepts
for open-set recognition during inference. Evaluations on comprehensive image
recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art
(SOTA) fundamental image recognition models on most aspects. Specifically, for
predefined common-used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP
enhancements over CLIP on OpenImages and ImageNet. For open-set categories
beyond predefined, RAM++ records improvements of 5 mAP and 6.4 mAP over CLIP
and RAM respectively on OpenImages. For diverse human-object interaction
phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark.
Code, datasets and pre-trained models are available at
https://github.com/xinyu1205/recognize-anything.