セマンティック概念を画像タグ付けに注入してオープンセット認識を実現

要旨

本論文では、Recognize Anything Plus Model（RAM++）を紹介します。これは、セマンティック概念を画像タグ付け学習フレームワークに注入することで、強力なオープンセット認識能力を備えた基本的な画像認識モデルです。従来のアプローチは、限られたセマンティクスに制約された画像タグ付けモデルか、マルチタグ認識において最適でない浅い相互作用を持つ視覚言語モデルのいずれかでした。対照的に、RAM++は、画像-テキストのアラインメントと画像タグ付けを、画像-タグ-テキストのトリプレットに基づく統一された細粒度相互作用フレームワーク内で統合します。この設計により、RAM++は事前定義されたカテゴリの識別に優れるだけでなく、オープンセットカテゴリにおける認識能力を大幅に向上させます。さらに、RAM++は大規模言語モデル（LLM）を活用して多様な視覚タグ記述を生成し、LLMの知識を画像タグ付け学習に統合する先駆的な試みを行います。このアプローチにより、RAM++は推論時に視覚記述概念を統合してオープンセット認識を行うことが可能になります。包括的な画像認識ベンチマークでの評価により、RAM++がほとんどの側面で既存の最先端（SOTA）の基本的な画像認識モデルを凌駕することが示されています。具体的には、事前定義された一般的なタグカテゴリにおいて、RAM++はOpenImagesとImageNetでCLIPに対してそれぞれ10.2 mAPと15.4 mAPの向上を示しました。事前定義を超えたオープンセットカテゴリでは、RAM++はOpenImagesでCLIPとRAMに対してそれぞれ5 mAPと6.4 mAPの改善を記録しました。多様な人間-物体相互作用フレーズにおいては、RAM++はHICOベンチマークで7.8 mAPと4.7 mAPの向上を達成しました。コード、データセット、および事前学習済みモデルはhttps://github.com/xinyu1205/recognize-anythingで公開されています。

English

In this paper, we introduce the Recognize Anything Plus Model~(RAM++), a fundamental image recognition model with strong open-set recognition capabilities, by injecting semantic concepts into image tagging training framework. Previous approaches are either image tagging models constrained by limited semantics, or vision-language models with shallow interaction for suboptimal performance in multi-tag recognition. In contrast, RAM++ integrates image-text alignment and image-tagging within a unified fine-grained interaction framework based on image-tags-text triplets. This design enables RAM++ not only excel in identifying predefined categories, but also significantly augment the recognition ability in open-set categories. Moreover, RAM++ employs large language models~(LLMs) to generate diverse visual tag descriptions, pioneering the integration of LLM's knowledge into image tagging training. This approach empowers RAM++ to integrate visual description concepts for open-set recognition during inference. Evaluations on comprehensive image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) fundamental image recognition models on most aspects. Specifically, for predefined common-used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

セマンティック概念を画像タグ付けに注入してオープンセット認識を実現

Inject Semantic Concepts into Image Tagging for Open-Set Recognition

要旨

Support