UGC-VideoCaptioner：オムニUGCビデオ詳細キャプションモデルと新たなベンチマーク

要旨

現実世界のユーザー生成動画、特にTikTokのようなプラットフォームでは、豊かで絡み合ったオーディオビジュアルコンテンツが特徴的です。しかし、既存の動画キャプションベンチマークやモデルは依然として視覚中心であり、シーンのダイナミクス、話者の意図、物語の文脈を伝える上でオーディオが果たす重要な役割を見落としています。このようなオムニモーダルデータセットと軽量で有能なモデルの不足は、細粒度のマルチモーダル動画理解の進展を妨げています。これらの課題に対処するため、我々はUGC-VideoCapを導入します。これは、短編ユーザー生成動画の詳細なオムニモーダルキャプショニングに特化した新しいベンチマークとモデルフレームワークです。従来のデータセットとは異なり、UGC-VideoCapはオーディオと視覚モダリティのバランスの取れた統合を重視し、オーディオのみ、視覚のみ、そしてオーディオビジュアルの統合セマンティクスをカバーする構造化された3段階のヒューマンインザループパイプラインを通じて注釈付けされた1000本のTikTok動画を特徴としています。また、このベンチマークには、ユニモーダルおよびクロスモーダル理解を探る4000の慎重に作成されたQAペアも含まれています。データセットと共に、我々はGemini 2.5 Flashから蒸留された3BパラメータのキャプションモデルであるUGC-VideoCaptioner(3B)を提案します。新しい2段階のトレーニング戦略（教師ありファインチューニングに続くGroup Relative Policy Optimization (GRPO)）を使用することで、限られたデータからの効率的な適応を可能にしつつ、競争力のあるパフォーマンスを維持します。我々のベンチマークとモデルは、制約のない現実世界のUGC設定におけるオムニモーダル動画キャプショニングの進展に向けた高品質な基盤とデータ効率の良いソリューションを提供します。

English

Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

UGC-VideoCaptioner：オムニUGCビデオ詳細キャプションモデルと新たなベンチマーク

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

要旨

Support