Florence-2:推進統一的表示以應對多樣的視覺任務
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
November 10, 2023
作者: Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan
cs.AI
摘要
我們介紹了Florence-2,這是一個新穎的視覺基礎模型,具有統一的基於提示的表示形式,適用於各種計算機視覺和視覺語言任務。雖然現有的大型視覺模型擅長於遷移學習,但在執行各種具有簡單指令的任務時卻遇到困難,這種能力意味著需要處理各種空間層次和語義細節的複雜性。Florence-2被設計為接受文本提示作為任務指令,並生成以文本形式呈現的理想結果,無論是標題、物體檢測、定位還是分割。這種多任務學習設置需要大規模、高質量的標註數據。為此,我們共同開發了FLD-5B,其中包含了對1.26億張圖像進行了54億個全面的視覺標註,使用了自動圖像標註和模型優化的迭代策略。我們採用了序列到序列的結構來訓練Florence-2,以執行多功能和全面的視覺任務。對眾多任務的廣泛評估顯示,Florence-2是一個具有前所未有的零-shot和微調能力的強大視覺基礎模型競爭者。
English
We introduce Florence-2, a novel vision foundation model with a unified,
prompt-based representation for a variety of computer vision and
vision-language tasks. While existing large vision models excel in transfer
learning, they struggle to perform a diversity of tasks with simple
instructions, a capability that implies handling the complexity of various
spatial hierarchy and semantic granularity. Florence-2 was designed to take
text-prompt as task instructions and generate desirable results in text forms,
whether it be captioning, object detection, grounding or segmentation. This
multi-task learning setup demands large-scale, high-quality annotated data. To
this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive
visual annotations on 126 million images, using an iterative strategy of
automated image annotation and model refinement. We adopted a
sequence-to-sequence structure to train Florence-2 to perform versatile and
comprehensive vision tasks. Extensive evaluations on numerous tasks
demonstrated Florence-2 to be a strong vision foundation model contender with
unprecedented zero-shot and fine-tuning capabilities.