將語義概念注入影像標記以進行開放式識別。

摘要

本文介紹了Recognize Anything Plus Model~(RAM++)，這是一個具有強大開放式辨識能力的基礎影像識別模型，通過將語義概念注入影像標記訓練框架中。先前的方法要麼是受限於有限語義的影像標記模型，要麼是在多標記識別中性能不佳的視覺語言模型，交互作用較淺。相比之下，RAM++在基於影像-標記-文本三元組的統一細粒度交互框架中集成了影像-文本對齊和影像標記。這種設計使RAM++不僅在識別預定類別方面表現出色，還顯著增強了對開放式類別的識別能力。此外，RAM++採用大型語言模型~(LLMs) 生成多樣的視覺標記描述，開創了將LLM知識整合到影像標記訓練中的方法。這種方法賦予RAM++在推斷期間整合視覺描述概念以進行開放式識別的能力。對全面的影像識別基準進行的評估表明，RAM++在大多數方面均超越現有的基礎影像識別模型的最新技術水平(SOTA)。具體而言，對於預定義的常用標記類別，RAM++在OpenImages和ImageNet上分別展示了10.2 mAP和15.4 mAP的優勢，超過了CLIP。對於超出預定義的開放式類別，RAM++在OpenImages上分別比CLIP和RAM提高了5 mAP和6.4 mAP。對於多樣的人-物互動短語，RAM++在HICO基準上實現了7.8 mAP和4.7 mAP的改進。代碼、數據集和預訓練模型可在https://github.com/xinyu1205/recognize-anything 上找到。

English

In this paper, we introduce the Recognize Anything Plus Model~(RAM++), a fundamental image recognition model with strong open-set recognition capabilities, by injecting semantic concepts into image tagging training framework. Previous approaches are either image tagging models constrained by limited semantics, or vision-language models with shallow interaction for suboptimal performance in multi-tag recognition. In contrast, RAM++ integrates image-text alignment and image-tagging within a unified fine-grained interaction framework based on image-tags-text triplets. This design enables RAM++ not only excel in identifying predefined categories, but also significantly augment the recognition ability in open-set categories. Moreover, RAM++ employs large language models~(LLMs) to generate diverse visual tag descriptions, pioneering the integration of LLM's knowledge into image tagging training. This approach empowers RAM++ to integrate visual description concepts for open-set recognition during inference. Evaluations on comprehensive image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) fundamental image recognition models on most aspects. Specifically, for predefined common-used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

將語義概念注入影像標記以進行開放式識別。

Inject Semantic Concepts into Image Tagging for Open-Set Recognition

摘要

Support