Recognize Anything：強力な画像タグ付けモデル

要旨

Recognize Anything Model（RAM）を紹介します。これは画像タグ付けのための強力な基盤モデルです。RAMは、一般的なカテゴリを高精度で認識することができます。RAMは、手動のアノテーションではなく、大規模な画像-テキストペアを活用してトレーニングを行う、画像タグ付けの新しいパラダイムを導入します。RAMの開発は、以下の4つの主要なステップで構成されています。まず、自動テキスト意味解析を通じて、大規模なアノテーションフリーの画像タグを取得します。次に、キャプションとタグ付けのタスクを統合し、元のテキストと解析されたタグをそれぞれ教師信号として使用して、自動アノテーションのための予備モデルをトレーニングします。第三に、データエンジンを使用して追加のアノテーションを生成し、誤ったアノテーションをクリーニングします。最後に、処理されたデータでモデルを再トレーニングし、より小規模だが高品質なデータセットを使用してファインチューニングを行います。RAMのタグ付け能力を多数のベンチマークで評価し、印象的なゼロショット性能を観察しました。これはCLIPやBLIPを大幅に上回るものです。驚くべきことに、RAMは完全に教師ありの手法さえも凌駕し、Google APIと競争力のある性能を示しています。RAMをhttps://recognize-anything.github.io/で公開し、コンピュータビジョンにおける大規模モデルの進展を促進します。

English

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM can recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google API. We are releasing the RAM at https://recognize-anything.github.io/ to foster the advancements of large models in computer vision.

Recognize Anything：強力な画像タグ付けモデル

Recognize Anything: A Strong Image Tagging Model

要旨

Support