Florence-2: 多様な視覚タスクのための統一的表現の進化

要旨

我々はFlorence-2を紹介する。これは、様々なコンピュータビジョンおよび視覚言語タスクに対して統一されたプロンプトベースの表現を持つ新しいビジョン基盤モデルである。既存の大規模ビジョンモデルは転移学習において優れているが、単純な指示で多様なタスクを実行する能力、つまり様々な空間階層や意味的粒度の複雑さを扱う能力に課題がある。Florence-2は、テキストプロンプトをタスク指示として受け取り、キャプショニング、物体検出、グラウンディング、セグメンテーションなどのテキスト形式で望ましい結果を生成するように設計されている。このマルチタスク学習の設定には、大規模で高品質な注釈データが必要である。この目的のために、我々は1億2600万枚の画像に54億の包括的な視覚注釈を含むFLD-5Bを共同開発し、自動画像注釈とモデル改良の反復戦略を採用した。Florence-2を訓練するために、シーケンス・ツー・シーケンス構造を採用し、多様で包括的なビジョンタスクを実行できるようにした。多数のタスクに対する広範な評価により、Florence-2が前例のないゼロショットおよびファインチューニング能力を持つ強力なビジョン基盤モデルの候補であることが示された。

English

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

Florence-2: 多様な視覚タスクのための統一的表現の進化

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

要旨

Support