大規模圖像和視頻的通用對象基礎模型
General Object Foundation Model for Images and Videos at Scale
December 14, 2023
作者: Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
cs.AI
摘要
本文介紹了GLEE,一個用於在圖像和視頻中定位和識別物體的對象級基礎模型。通過統一框架,GLEE實現了對開放世界場景中任意物體的檢測、分割、跟蹤、定位和識別,適用於各種物體感知任務。通過採用一致的學習策略,GLEE從不同監督級別的多樣數據源中獲取知識,形成通用物體表示,在零樣本轉移到新數據和任務方面表現出色。具體來說,我們使用圖像編碼器、文本編碼器和視覺提示器來處理多模態輸入,實現同時解決各種以物體為中心的下游任務,同時保持最先進的性能。通過在來自不同基準測試的五百萬多圖像上進行廣泛訓練,GLEE展現出卓越的多功能性和改進的泛化性能,有效應對下游任務,無需特定於任務的適應。通過集成大量自動標記的數據,我們進一步增強了其零樣本泛化能力。此外,GLEE能夠集成到大型語言模型中,作為一個基礎模型,為多模態任務提供通用的對象級信息。我們希望我們方法的多功能性和通用性將標誌著為AGI系統開發高效視覺基礎模型的重要一步。模型和代碼將在https://glee-vision.github.io 上發布。
English
We present GLEE in this work, an object-level foundation model for locating
and identifying objects in images and videos. Through a unified framework, GLEE
accomplishes detection, segmentation, tracking, grounding, and identification
of arbitrary objects in the open world scenario for various object perception
tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from
diverse data sources with varying supervision levels to formulate general
object representations, excelling in zero-shot transfer to new data and tasks.
Specifically, we employ an image encoder, text encoder, and visual prompter to
handle multi-modal inputs, enabling to simultaneously solve various
object-centric downstream tasks while maintaining state-of-the-art performance.
Demonstrated through extensive training on over five million images from
diverse benchmarks, GLEE exhibits remarkable versatility and improved
generalization performance, efficiently tackling downstream tasks without the
need for task-specific adaptation. By integrating large volumes of
automatically labeled data, we further enhance its zero-shot generalization
capabilities. Additionally, GLEE is capable of being integrated into Large
Language Models, serving as a foundational model to provide universal
object-level information for multi-modal tasks. We hope that the versatility
and universality of our method will mark a significant step in the development
of efficient visual foundation models for AGI systems. The model and code will
be released at https://glee-vision.github.io .Summary
AI-Generated Summary