PerceptionLM: 詳細な視覚理解のためのオープンアクセスデータとモデル

要旨

視覚言語モデルはコンピュータビジョン研究において不可欠な存在であるが、多くの高性能モデルはクローズドソースのままであり、そのデータ、設計、トレーニング手法が不明瞭な状態にある。研究コミュニティは、ブラックボックスモデルからの蒸留を用いてトレーニングデータにラベルを付けることで対応し、強力なベンチマーク結果を達成しているが、その代償として科学的進歩の測定可能性が損なわれている。しかし、教師モデルの詳細やそのデータソースを知らないままでは、科学的進歩を測定することは依然として困難である。本論文では、画像および映像理解における透明性のある研究を目的として、完全にオープンで再現可能なフレームワーク内で知覚言語モデル（PLM）を構築する方法を検討する。プロプライエタリモデルからの蒸留を行わない標準的なトレーニングパイプラインを分析し、大規模な合成データを探索することで、特に詳細な映像理解における重要なデータギャップを特定する。これらのギャップを埋めるため、280万件の人間によるラベル付きの細粒度な映像質問応答ペアおよび時空間的に基づいた映像キャプションを公開する。さらに、「何が」「どこで」「いつ」「どのように」という映像の理解能力に焦点を当てた挑戦的な映像理解タスクを評価するためのスイートであるPLM-VideoBenchを導入する。データ、トレーニングレシピ、コード、モデルを提供することで、本研究の完全な再現性を確保する。

English

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.

PerceptionLM: 詳細な視覚理解のためのオープンアクセスデータとモデル

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

要旨

Support