Florence-2:推进多种视觉任务的统一表示
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
November 10, 2023
作者: Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan
cs.AI
摘要
我们介绍了Florence-2,这是一个新颖的视觉基础模型,具有统一的、基于提示的表示形式,可用于各种计算机视觉和视觉-语言任务。现有的大型视觉模型擅长迁移学习,但在执行多样化的任务时往往难以使用简单的指令,这需要处理各种空间层次和语义粒度的复杂性。Florence-2被设计为接受文本提示作为任务指令,并生成文本形式的理想结果,无论是字幕生成、目标检测、定位还是分割。这种多任务学习设置需要大规模、高质量的标注数据。为此,我们共同开发了FLD-5B,其中包含了来自1.26亿张图像的54亿个全面的视觉标注,采用了自动图像标注和模型优化的迭代策略。我们采用了序列到序列结构来训练Florence-2执行多样化和全面的视觉任务。对多项任务的广泛评估表明,Florence-2是一个强大的视觉基础模型候选者,具有前所未有的零-shot和微调能力。
English
We introduce Florence-2, a novel vision foundation model with a unified,
prompt-based representation for a variety of computer vision and
vision-language tasks. While existing large vision models excel in transfer
learning, they struggle to perform a diversity of tasks with simple
instructions, a capability that implies handling the complexity of various
spatial hierarchy and semantic granularity. Florence-2 was designed to take
text-prompt as task instructions and generate desirable results in text forms,
whether it be captioning, object detection, grounding or segmentation. This
multi-task learning setup demands large-scale, high-quality annotated data. To
this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive
visual annotations on 126 million images, using an iterative strategy of
automated image annotation and model refinement. We adopted a
sequence-to-sequence structure to train Florence-2 to perform versatile and
comprehensive vision tasks. Extensive evaluations on numerous tasks
demonstrated Florence-2 to be a strong vision foundation model contender with
unprecedented zero-shot and fine-tuning capabilities.