Florence-2: 다양한 비전 작업을 위한 통합 표현 방식의 발전

초록

우리는 다양한 컴퓨터 비전 및 비전-언어 작업을 위한 통합된 프롬프트 기반 표현을 갖춘 새로운 비전 기반 모델인 Florence-2를 소개한다. 기존의 대형 비전 모델들은 전이 학습에서는 뛰어난 성능을 보이지만, 다양한 공간 계층 구조와 의미론적 세분성을 다루는 능력을 의미하는 간단한 지시로 다양한 작업을 수행하는 데는 어려움을 겪는다. Florence-2는 텍스트 프롬프트를 작업 지시로 받아 캡션 생성, 객체 탐지, 그라운딩 또는 세그멘테이션과 같은 텍스트 형태의 결과를 생성하도록 설계되었다. 이러한 다중 작업 학습 설정은 대규모의 고품질 주석 데이터를 요구한다. 이를 위해 우리는 자동화된 이미지 주석과 모델 개선의 반복 전략을 사용하여 1억 2,600만 개의 이미지에 대한 54억 개의 포괄적인 시각 주석으로 구성된 FLD-5B를 공동 개발하였다. 우리는 Florence-2를 다양한 포괄적인 비전 작업을 수행하도록 훈련시키기 위해 시퀀스-투-시퀀스 구조를 채택하였다. 다양한 작업에 대한 광범위한 평가를 통해 Florence-2가 전례 없는 제로샷 및 미세 조정 능력을 갖춘 강력한 비전 기반 모델 후보임을 입증하였다.

English

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

Florence-2: 다양한 비전 작업을 위한 통합 표현 방식의 발전

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

초록

Support