어떤 것도 인식하는 강력한 이미지 태깅 모델

초록

우리는 이미지 태깅을 위한 강력한 기초 모델인 Recognize Anything Model(RAM)을 소개합니다. RAM은 일반적인 카테고리를 높은 정확도로 인식할 수 있습니다. RAM은 수동 주석 대신 대규모 이미지-텍스트 쌍을 활용하여 이미지 태깅을 위한 새로운 패러다임을 제시합니다. RAM의 개발은 네 가지 주요 단계로 구성됩니다. 첫째, 자동 텍스트 의미 분석을 통해 대규모로 주석 없는 이미지 태그를 획득합니다. 둘째, 원본 텍스트와 파싱된 태그를 각각 감독으로 사용하여 캡션과 태깅 작업을 통합하여 자동 주석을 위한 예비 모델을 학습시킵니다. 셋째, 데이터 엔진을 사용하여 추가 주석을 생성하고 잘못된 주석을 정리합니다. 마지막으로, 처리된 데이터로 모델을 재학습시키고 더 작지만 더 높은 품질의 데이터셋을 사용하여 미세 조정합니다. 우리는 RAM의 태깅 능력을 다양한 벤치마크에서 평가하고 인상적인 제로샷 성능을 관찰했으며, CLIP과 BLIP을 크게 능가하는 결과를 보였습니다. 특히, RAM은 완전 감독 방식도 능가하며 Google API와 경쟁력 있는 성능을 보였습니다. 우리는 RAM을 https://recognize-anything.github.io/에서 공개하여 컴퓨터 비전 분야의 대형 모델 발전을 촉진하고자 합니다.

English

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM can recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google API. We are releasing the RAM at https://recognize-anything.github.io/ to foster the advancements of large models in computer vision.

어떤 것도 인식하는 강력한 이미지 태깅 모델

Recognize Anything: A Strong Image Tagging Model

초록

Support