OmniCaptioner: 모든 것을 포괄하는 단일 캡셔너

초록

우리는 다양한 시각적 도메인에 걸쳐 세밀한 텍스트 설명을 생성하기 위한 다목적 시각 캡셔닝 프레임워크인 OmniCaptioner를 제안합니다. 특정 이미지 유형(예: 자연 이미지 또는 기하학적 시각 자료)에 제한된 기존 방법과 달리, 우리의 프레임워크는 자연 이미지, 시각적 텍스트(예: 포스터, UI, 교과서), 구조화된 시각 자료(예: 문서, 표, 차트)에 대한 통합된 캡셔닝 솔루션을 제공합니다. 저수준 픽셀 정보를 의미론적으로 풍부한 텍스트 표현으로 변환함으로써, 우리의 프레임워크는 시각적 모달리티와 텍스트 모달리티 간의 간극을 메웁니다. 우리의 결과는 세 가지 주요 장점을 강조합니다: (i) LLM을 통한 향상된 시각적 추론, 여기서 시각적 모달리티의 장문 캡션은 특히 DeepSeek-R1 시리즈와 같은 LLM이 다중모달 시나리오에서 효과적으로 추론할 수 있도록 돕습니다; (ii) 개선된 이미지 생성, 여기서 상세한 캡션은 텍스트-이미지 생성 및 이미지 변환과 같은 작업을 개선합니다; (iii) 효율적인 지도 미세 조정(SFT), 이는 더 적은 데이터로 더 빠른 수렴을 가능하게 합니다. 우리는 OmniCaptioner의 다용도성과 적응성이 언어와 시각적 모달리티 간의 간극을 메우는 새로운 관점을 제공할 수 있다고 믿습니다.

English

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.