대규모 이미지 및 비디오를 위한 일반 객체 기반 모델

초록

본 연구에서는 이미지와 비디오에서 객체를 탐지하고 식별하기 위한 객체 수준의 파운데이션 모델인 GLEE를 소개한다. GLEE는 통합된 프레임워크를 통해 다양한 객체 인식 작업을 위한 개방형 세계 시나리오에서 임의의 객체에 대한 탐지, 분할, 추적, 그라운딩 및 식별을 수행한다. GLEE는 일관된 학습 전략을 채택하여 다양한 감독 수준의 데이터 소스로부터 지식을 습득함으로써 일반적인 객체 표현을 형성하며, 새로운 데이터와 작업에 대한 제로샷 전이에서 탁월한 성능을 보인다. 구체적으로, GLEE는 이미지 인코더, 텍스트 인코더 및 시각적 프롬프터를 활용하여 다중 모달 입력을 처리함으로써 다양한 객체 중심의 하위 작업을 동시에 해결하면서도 최신 기술 수준의 성능을 유지한다. 다양한 벤치마크에서 수집된 500만 장 이상의 이미지를 통해 광범위하게 학습된 GLEE는 뛰어난 다용성과 개선된 일반화 성능을 보여주며, 작업별 적응 없이도 하위 작업을 효율적으로 처리한다. 자동으로 레이블이 지정된 대량의 데이터를 통합함으로써 제로샷 일반화 능력을 더욱 향상시킨다. 또한, GLEE는 대형 언어 모델에 통합될 수 있어 다중 모달 작업을 위한 보편적인 객체 수준 정보를 제공하는 파운데이션 모델로 활용될 수 있다. 우리는 본 방법의 다용성과 보편성이 AGI 시스템을 위한 효율적인 시각적 파운데이션 모델 개발에 있어 중요한 진전을 이룰 것으로 기대한다. 모델과 코드는 https://glee-vision.github.io에서 공개될 예정이다.

English

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

대규모 이미지 및 비디오를 위한 일반 객체 기반 모델

General Object Foundation Model for Images and Videos at Scale

초록

Support