UGC-VideoCaptioner：一個全能型UGC影片細節描述模型與新基準

摘要

現實世界中用戶生成的視頻，尤其是在TikTok等平台上，通常包含豐富且交織的視聽內容。然而，現有的視頻字幕生成基準和模型仍然主要依賴視覺信息，忽視了音頻在傳達場景動態、說話者意圖和敘事背景中的關鍵作用。缺乏全面的數據集和輕量級、能力強的模型，阻礙了精細多模態視頻理解的進展。為應對這些挑戰，我們引入了UGC-VideoCap，這是一個專門為短格式用戶生成視頻的詳細全模態字幕生成而設計的新基準和模型框架。與之前的數據集不同，UGC-VideoCap強調音頻和視覺模態的平衡整合，包含1000個TikTok視頻，通過一個結構化的三階段人機協作流程進行註釋，涵蓋僅音頻、僅視覺以及聯合音頻視覺語義。該基準還包括4000個精心設計的問答對，探討單模態和跨模態理解。與數據集一起，我們提出了UGC-VideoCaptioner(3B)，這是一個從Gemini 2.5 Flash蒸餾而來的3B參數字幕生成模型。通過新穎的兩階段訓練策略——監督微調後接組相對策略優化（GRPO），我們的方法能夠在有限數據下實現高效適應，同時保持競爭性能。我們的基準和模型共同為在無約束的現實世界UGC環境中推進全模態視頻字幕生成提供了高質量的基礎和數據高效的解決方案。

English

Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

UGC-VideoCaptioner：一個全能型UGC影片細節描述模型與新基準

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

摘要

Support