URECA: ユニーク領域キャプション生成

要旨

リージョンレベルのキャプショニングは、特定の画像領域に対して自然言語による説明を生成し、その際にそれらの識別特徴を強調することを目的としています。しかし、既存の手法では、マルチグラニュラリティにわたって一意なキャプションを生成することが難しく、実世界での適用性が制限されています。詳細なリージョンレベルの理解の必要性に対応するため、我々はマルチグラニュラリティリージョンキャプショニングに特化した大規模データセットであるURECAデータセットを導入します。従来のデータセットが主に顕著なオブジェクトに焦点を当てていたのに対し、URECAデータセットは、多様なオブジェクト、パーツ、背景要素を取り入れることで、リージョンとキャプションの間の一意で一貫したマッピングを保証します。その中心となるのは、段階的なデータキュレーションパイプラインであり、各段階でリージョンの選択とキャプション生成を徐々に洗練させます。各段階でマルチモーダル大規模言語モデル（MLLMs）を活用することで、我々のパイプラインは、精度と意味的多様性が向上した、特徴的で文脈に基づいたキャプションを生成します。このデータセットを基に、我々はマルチグラニュラリティリージョンを効果的にエンコードするための新しいキャプショニングモデルであるURECAを提案します。URECAは、既存のMLLMsにシンプルでありながら効果的な修正を加えることで、位置や形状といった重要な空間的特性を維持し、細粒度で意味的に豊かなリージョン記述を可能にします。我々のアプローチでは、キャプションの一意性を高めるために、動的マスクモデリングと高解像度マスクエンコーダを導入しています。実験結果は、URECAがURECAデータセットで最先端の性能を達成し、既存のリージョンレベルキャプショニングベンチマークにもうまく一般化することを示しています。

English

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

URECA: ユニーク領域キャプション生成

URECA: Unique Region Caption Anything

要旨

Support