어떤 것이라도 설명하기: 세밀한 지역화 이미지 및 비디오 캡셔닝

초록

이미지와 비디오 내 특정 영역에 대한 상세하고 정확한 설명을 생성하는 것은 시각-언어 모델에게 여전히 근본적인 과제로 남아 있습니다. 우리는 상세 지역 캡셔닝(DLC)을 위해 설계된 Describe Anything Model(DAM)을 소개합니다. DAM은 두 가지 핵심 혁신을 통해 지역적 세부 사항과 전역적 맥락을 모두 보존합니다: 하나는 대상 영역의 고해상도 인코딩을 보장하는 포컬 프롬프트(focal prompt)이고, 다른 하나는 정확한 지역화를 더 넓은 맥락과 통합하는 지역화된 시각 백본(localized vision backbone)입니다. 고품질 DLC 데이터의 부족 문제를 해결하기 위해, 우리는 준지도 학습(SSL) 기반 데이터 파이프라인(DLC-SDP)을 제안합니다. DLC-SDP는 기존의 세그멘테이션 데이터셋에서 시작하여 SSL을 사용해 레이블이 없는 웹 이미지로 확장합니다. 또한, 참조 캡션에 의존하지 않고 DLC를 평가하기 위해 설계된 벤치마크인 DLC-Bench를 소개합니다. DAM은 키워드 수준, 구문 수준, 그리고 상세한 다중 문장 지역화 이미지 및 비디오 캡셔닝에 걸친 7개의 벤치마크에서 새로운 최첨단 성능을 달성했습니다.

English

Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

어떤 것이라도 설명하기: 세밀한 지역화 이미지 및 비디오 캡셔닝

Describe Anything: Detailed Localized Image and Video Captioning

초록

Support