大規模な画像と動画のための汎用物体基盤モデル

要旨

本論文では、画像や動画における物体の位置特定と識別を行うための物体レベルの基盤モデルであるGLEEを提案する。GLEEは統一されたフレームワークを通じて、オープンワールドシナリオにおける任意の物体の検出、セグメンテーション、追跡、グラウンディング、識別を実現し、様々な物体認識タスクに対応する。一貫した学習戦略を採用することで、GLEEは多様なデータソースから異なる監督レベルで知識を獲得し、一般的な物体表現を形成し、新しいデータやタスクへのゼロショット転移において優れた性能を発揮する。具体的には、画像エンコーダ、テキストエンコーダ、ビジュアルプロンプターを活用してマルチモーダル入力を処理し、様々な物体中心の下流タスクを同時に解決しながら、最先端の性能を維持する。多様なベンチマークから500万枚以上の画像を用いた大規模なトレーニングを通じて、GLEEは驚異的な汎用性と改善された一般化性能を示し、タスク固有の適応を必要とせずに下流タスクを効率的に処理する。自動ラベル付けされた大量のデータを統合することで、さらにゼロショット一般化能力を向上させる。加えて、GLEEは大規模言語モデルに統合可能であり、マルチモーダルタスクに普遍的な物体レベルの情報を提供する基盤モデルとして機能する。本手法の汎用性と普遍性が、AGIシステムのための効率的な視覚基盤モデルの開発において重要な一歩となることを期待する。モデルとコードはhttps://glee-vision.github.ioで公開予定である。

English

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

大規模な画像と動画のための汎用物体基盤モデル

General Object Foundation Model for Images and Videos at Scale

要旨

Support