ChatPaper.aiChatPaper

TimeChat-Captioner:基于时间感知与结构化音视频描述的多元场景视频脚本生成系统

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

February 9, 2026
作者: Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen, Feifan Song, Ziyue Wang, Kun Ouyang, Yuanxin Liu, Lingpeng Kong, Qi Liu, Pengfei Wan, Kun Gai, Yuanxing Zhang, Xu Sun
cs.AI

摘要

本文提出全感知密集描述(Omni Dense Captioning)这一创新任务,旨在生成具有明确时间戳的连续、细粒度、结构化音视频叙事。为实现密集语义覆盖,我们引入六维结构化框架来创建"剧本式"描述,使读者能逐场景生动想象视频内容,犹如观看电影剧本。为推进研究,我们构建了高质量人工标注基准OmniDCBench,并提出统一评估指标SodaM,该指标在评估时间感知细节描述的同时缓解场景边界模糊问题。此外,我们构建了包含42K样本的训练数据集TimeChatCap-42K,并推出通过SFT和GRPO配合任务特定奖励训练的强基线模型TimeChat-Captioner-7B。大量实验表明,TimeChat-Captioner-7B实现了最先进性能,超越Gemini-2.5-Pro,其生成的密集描述显著提升了音视频推理(DailyOmni和WorldSense)与时序定位(Charades-STA)等下游任务能力。所有数据集、模型及代码均公开于https://github.com/yaolinli/TimeChat-Captioner。
English
This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.
PDF221February 13, 2026