ChatPaper.aiChatPaper

辨識一切:強大的影像標記模型

Recognize Anything: A Strong Image Tagging Model

June 6, 2023
作者: Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei Zhang
cs.AI

摘要

我們提出了Recognize Anything Model (RAM):一個強大的基礎模型,用於圖像標記。RAM能夠高準確度地識別任何常見類別。RAM引入了一種新的圖像標記範式,利用大規模的圖像-文本配對進行訓練,而非手動標註。RAM的開發包括四個關鍵步驟。首先,通過自動文本語義解析獲取大規模的無標註圖像標籤。隨後,通過統一標題和標記任務訓練初步模型,分別由原始文本和解析的標籤監督。第三,使用數據引擎生成額外標註並清理錯誤標註。最後,使用處理過的數據對模型進行重新訓練,並使用較小但更高質量的數據集進行微調。我們在眾多基準測試中評估了RAM的標記能力,觀察到令人印象深刻的零樣本表現,明顯優於CLIP和BLIP。值得注意的是,RAM甚至超越了完全監督的方法,展現出與Google API競爭性能。我們將RAM釋出至https://recognize-anything.github.io/,以促進在計算機視覺中大模型的進步。
English
We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM can recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google API. We are releasing the RAM at https://recognize-anything.github.io/ to foster the advancements of large models in computer vision.
PDF116December 15, 2024