辨識一切：強大的影像標記模型

摘要

我們提出了Recognize Anything Model (RAM)：一個強大的基礎模型，用於圖像標記。RAM能夠高準確度地識別任何常見類別。RAM引入了一種新的圖像標記範式，利用大規模的圖像-文本配對進行訓練，而非手動標註。RAM的開發包括四個關鍵步驟。首先，通過自動文本語義解析獲取大規模的無標註圖像標籤。隨後，通過統一標題和標記任務訓練初步模型，分別由原始文本和解析的標籤監督。第三，使用數據引擎生成額外標註並清理錯誤標註。最後，使用處理過的數據對模型進行重新訓練，並使用較小但更高質量的數據集進行微調。我們在眾多基準測試中評估了RAM的標記能力，觀察到令人印象深刻的零樣本表現，明顯優於CLIP和BLIP。值得注意的是，RAM甚至超越了完全監督的方法，展現出與Google API競爭性能。我們將RAM釋出至https://recognize-anything.github.io/，以促進在計算機視覺中大模型的進步。

English

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM can recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google API. We are releasing the RAM at https://recognize-anything.github.io/ to foster the advancements of large models in computer vision.

辨識一切：強大的影像標記模型

Recognize Anything: A Strong Image Tagging Model

摘要

Support