OmniCaptioner: すべてを統べるキャプショナー

要旨

我々は、多様な視覚領域にわたる細粒度のテキスト記述を生成するための汎用的な視覚キャプションフレームワーク「OmniCaptioner」を提案する。特定の画像タイプ（例：自然画像や幾何学的視覚）に限定されていた従来手法とは異なり、本フレームワークは自然画像、視覚的テキスト（例：ポスター、UI、教科書）、および構造化視覚（例：文書、表、チャート）のキャプション生成を統一的に解決する。低レベルのピクセル情報を意味的に豊かなテキスト表現に変換することで、本フレームワークは視覚とテキストのモダリティ間のギャップを埋める。我々の結果は、以下の3つの主要な利点を強調している：(i) LLMを用いた強化された視覚推論 - 視覚モダリティの長文脈キャプションが、特にDeepSeek-R1シリーズにおいて、マルチモーダルシナリオでの効果的な推論を可能にする；(ii) 画像生成の改善 - 詳細なキャプションがテキストから画像への生成や画像変換などのタスクを向上させる；(iii) 効率的な教師ありファインチューニング（SFT） - より少ないデータで迅速な収束を実現する。OmniCaptionerの汎用性と適応性は、言語と視覚のモダリティ間のギャップを埋める新たな視点を提供すると我々は考えている。

English

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

OmniCaptioner: すべてを統べるキャプショナー

OmniCaptioner: One Captioner to Rule Them All

要旨

Support