辨識一切:強大的影像標記模型
Recognize Anything: A Strong Image Tagging Model
June 6, 2023
作者: Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei Zhang
cs.AI
摘要
我們提出了Recognize Anything Model (RAM):一個強大的基礎模型,用於圖像標記。RAM能夠高準確度地識別任何常見類別。RAM引入了一種新的圖像標記範式,利用大規模的圖像-文本配對進行訓練,而非手動標註。RAM的開發包括四個關鍵步驟。首先,通過自動文本語義解析獲取大規模的無標註圖像標籤。隨後,通過統一標題和標記任務訓練初步模型,分別由原始文本和解析的標籤監督。第三,使用數據引擎生成額外標註並清理錯誤標註。最後,使用處理過的數據對模型進行重新訓練,並使用較小但更高質量的數據集進行微調。我們在眾多基準測試中評估了RAM的標記能力,觀察到令人印象深刻的零樣本表現,明顯優於CLIP和BLIP。值得注意的是,RAM甚至超越了完全監督的方法,展現出與Google API競爭性能。我們將RAM釋出至https://recognize-anything.github.io/,以促進在計算機視覺中大模型的進步。
English
We present the Recognize Anything Model (RAM): a strong foundation model for
image tagging. RAM can recognize any common category with high accuracy. RAM
introduces a new paradigm for image tagging, leveraging large-scale image-text
pairs for training instead of manual annotations. The development of RAM
comprises four key steps. Firstly, annotation-free image tags are obtained at
scale through automatic text semantic parsing. Subsequently, a preliminary
model is trained for automatic annotation by unifying the caption and tagging
tasks, supervised by the original texts and parsed tags, respectively. Thirdly,
a data engine is employed to generate additional annotations and clean
incorrect ones. Lastly, the model is retrained with the processed data and
fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging
capabilities of RAM on numerous benchmarks and observe impressive zero-shot
performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even
surpasses the fully supervised manners and exhibits competitive performance
with the Google API. We are releasing the RAM at
https://recognize-anything.github.io/ to foster the advancements of large
models in computer vision.