ChatPaper.aiChatPaper

识别一切:强大的图像标记模型

Recognize Anything: A Strong Image Tagging Model

June 6, 2023
作者: Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei Zhang
cs.AI

摘要

我们提出了Recognize Anything Model(RAM):一种用于图像标记的强基础模型。RAM能够以高准确度识别任何常见类别。RAM引入了一种新的图像标记范式,利用大规模的图像-文本配对进行训练,而非手动注释。RAM的开发包括四个关键步骤。首先,通过自动文本语义解析在规模上获取无注释的图像标记。随后,通过统一字幕和标记任务训练初步模型,由原始文本和解析标记分别进行监督自动注释。第三,利用数据引擎生成额外注释并清理不正确的注释。最后,使用处理后的数据对模型进行重新训练,并使用较小但更高质量的数据集进行微调。我们在多个基准测试上评估了RAM的标记能力,并观察到令人印象深刻的零样本性能,明显优于CLIP和BLIP。值得注意的是,RAM甚至超越了完全监督的方式,并展现出与Google API竞争性能。我们将RAM发布在https://recognize-anything.github.io/,以促进计算机视觉中大型模型的进展。
English
We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM can recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google API. We are releasing the RAM at https://recognize-anything.github.io/ to foster the advancements of large models in computer vision.
PDF116December 15, 2024