URECA: 유니크 리전 캡션 애니띵

초록

지역 수준 캡셔닝은 특정 이미지 영역에 대한 자연어 설명을 생성하면서 해당 영역의 특징을 강조하는 것을 목표로 합니다. 그러나 기존 방법들은 다양한 세분화 수준에서 고유한 캡션을 생성하는 데 어려움을 겪으며, 이는 실제 적용 가능성을 제한합니다. 이러한 세부적인 지역 수준 이해의 필요성을 해결하기 위해, 우리는 다중 세분화 지역 캡셔닝을 위해 특화된 대규모 데이터셋인 URECA 데이터셋을 소개합니다. 주요 객체에 초점을 맞춘 기존 데이터셋과 달리, URECA 데이터셋은 다양한 객체, 부분, 배경 요소를 포함함으로써 지역과 캡션 간의 고유하고 일관된 매핑을 보장합니다. 이의 핵심은 단계별 데이터 정제 파이프라인으로, 각 단계에서 지역 선택과 캡션 생성을 점진적으로 개선합니다. 각 단계에서 다중 모달 대형 언어 모델(MLLM)을 활용함으로써, 우리의 파이프라인은 정확성과 의미적 다양성이 향상된 독특하고 문맥에 기반한 캡션을 생성합니다. 이 데이터셋을 기반으로, 우리는 다중 세분화 지역을 효과적으로 인코딩하도록 설계된 새로운 캡셔닝 모델인 URECA를 제시합니다. URECA는 기존 MLLM에 간단하지만 영향력 있는 수정을 통해 위치와 형태와 같은 필수적인 공간 속성을 유지하며, 세밀하고 의미적으로 풍부한 지역 설명을 가능하게 합니다. 우리의 접근 방식은 캡션의 고유성을 향상시키기 위해 동적 마스크 모델링과 고해상도 마스크 인코더를 도입합니다. 실험 결과, URECA는 URECA 데이터셋에서 최첨단 성능을 달성하며, 기존 지역 수준 캡셔닝 벤치마크에서도 잘 일반화됨을 보여줍니다.

English

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

URECA: 유니크 리전 캡션 애니띵

URECA: Unique Region Caption Anything

초록

Support