개방형 인식을 위한 이미지 태깅에 시맨틱 개념 주입

초록

본 논문에서는 시맨틱 개념을 이미지 태깅 학습 프레임워크에 주입하여 강력한 오픈셋 인식 능력을 갖춘 기본 이미지 인식 모델인 Recognize Anything Plus Model~(RAM++)을 소개한다. 기존 접근 방식은 제한된 시맨틱에 의해 제약받는 이미지 태깅 모델이거나, 다중 태그 인식에서 최적의 성능을 내지 못하는 얕은 상호작용을 가진 시각-언어 모델이었다. 반면, RAM++은 이미지-텍스트 정렬과 이미지 태깅을 이미지-태그-텍스트 삼중항 기반의 통합된 세밀한 상호작용 프레임워크 내에서 통합한다. 이러한 설계는 RAM++이 사전 정의된 카테고리를 식별하는 데 뛰어날 뿐만 아니라, 오픈셋 카테고리에서의 인식 능력을 크게 향상시킨다. 또한, RAM++은 대규모 언어 모델~(LLM)을 활용하여 다양한 시각적 태그 설명을 생성함으로써, LLM의 지식을 이미지 태깅 학습에 통합하는 선구적인 접근 방식을 채택한다. 이 방법은 RAM++이 추론 과정에서 시각적 설명 개념을 통합하여 오픈셋 인식을 수행할 수 있도록 한다. 포괄적인 이미지 인식 벤치마크에서의 평가 결과, RAM++은 대부분의 측면에서 기존의 최첨단(SOTA) 기본 이미지 인식 모델을 능가하는 것으로 나타났다. 구체적으로, 사전 정의된 일반적인 태그 카테고리에서 RAM++은 OpenImages와 ImageNet에서 CLIP 대비 각각 10.2 mAP와 15.4 mAP의 향상을 보였다. 사전 정의된 범위를 벗어난 오픈셋 카테고리에서는 OpenImages에서 CLIP과 RAM 대비 각각 5 mAP와 6.4 mAP의 개선을 기록했다. 다양한 인간-객체 상호작용 구문에 대해서는 HICO 벤치마크에서 7.8 mAP와 4.7 mAP의 향상을 달성했다. 코드, 데이터셋 및 사전 학습된 모델은 https://github.com/xinyu1205/recognize-anything에서 확인할 수 있다.

English

In this paper, we introduce the Recognize Anything Plus Model~(RAM++), a fundamental image recognition model with strong open-set recognition capabilities, by injecting semantic concepts into image tagging training framework. Previous approaches are either image tagging models constrained by limited semantics, or vision-language models with shallow interaction for suboptimal performance in multi-tag recognition. In contrast, RAM++ integrates image-text alignment and image-tagging within a unified fine-grained interaction framework based on image-tags-text triplets. This design enables RAM++ not only excel in identifying predefined categories, but also significantly augment the recognition ability in open-set categories. Moreover, RAM++ employs large language models~(LLMs) to generate diverse visual tag descriptions, pioneering the integration of LLM's knowledge into image tagging training. This approach empowers RAM++ to integrate visual description concepts for open-set recognition during inference. Evaluations on comprehensive image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) fundamental image recognition models on most aspects. Specifically, for predefined common-used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.

개방형 인식을 위한 이미지 태깅에 시맨틱 개념 주입

Inject Semantic Concepts into Image Tagging for Open-Set Recognition

초록

Support