何でも詳細に説明：ローカライズされた画像と動画のキャプショニング

要旨

画像や動画内の特定領域に対する詳細かつ正確な記述を生成することは、視覚言語モデルにとって依然として根本的な課題です。本研究では、詳細な局所キャプショニング（DLC）を目的としたDescribe Anything Model（DAM）を提案します。DAMは、2つの重要な革新を通じて、局所的な詳細とグローバルなコンテキストの両方を保持します。1つ目は、ターゲット領域の高解像度エンコーディングを保証するフォーカルプロンプト、2つ目は、正確な位置情報を広範なコンテキストと統合する局所視覚バックボーンです。高品質なDLCデータの不足に対処するため、半教師あり学習（SSL）ベースのデータパイプライン（DLC-SDP）を提案します。DLC-SDPは既存のセグメンテーションデータセットから始め、SSLを使用して未ラベルのウェブ画像に拡張します。また、参照キャプションに依存せずにDLCを評価するためのベンチマークであるDLC-Benchを導入します。DAMは、キーワードレベル、フレーズレベル、詳細な複数文にわたる局所画像および動画キャプショニングの7つのベンチマークで新たな最先端の性能を達成しました。

English

Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

何でも詳細に説明：ローカライズされた画像と動画のキャプショニング

Describe Anything: Detailed Localized Image and Video Captioning

要旨

Support