UGC-VideoCaptioner：面向全场景用户生成视频的细节描述模型与新基准

摘要

现实世界中用户生成的视频，尤其是在TikTok等平台上，往往包含丰富且交织的视听内容。然而，现有的视频字幕生成基准和模型仍主要侧重于视觉信息，忽视了音频在传达场景动态、说话者意图及叙事背景中的关键作用。缺乏全面的数据集及轻量级、高性能的模型，阻碍了细粒度多模态视频理解的进展。为应对这些挑战，我们推出了UGC-VideoCap，这是一个专为短格式用户生成视频的详细全模态字幕生成而设计的新基准和模型框架。与以往数据集不同，UGC-VideoCap强调音频与视觉模态的平衡整合，包含1000个TikTok视频，通过一个结构化的三阶段人机协作流程进行标注，涵盖仅音频、仅视觉及联合视听语义。该基准还包含4000个精心设计的问答对，用于探究单模态及跨模态理解。伴随数据集，我们提出了UGC-VideoCaptioner(3B)，一个从Gemini 2.5 Flash蒸馏而来的3B参数字幕生成模型。采用新颖的两阶段训练策略——监督微调后接组相对策略优化（GRPO），我们的方法能够在有限数据下实现高效适应，同时保持竞争力。我们的基准和模型共同为在无约束的真实世界UGC环境中推进全模态视频字幕生成提供了高质量的基础和数据高效的解决方案。

English

Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

UGC-VideoCaptioner：面向全场景用户生成视频的细节描述模型与新基准

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

摘要

Support