TimeChat-Captioner:基於時間感知與結構化視聽字幕的多場景影片腳本生成系統
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
February 9, 2026
作者: Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen, Feifan Song, Ziyue Wang, Kun Ouyang, Yuanxin Liu, Lingpeng Kong, Qi Liu, Pengfei Wan, Kun Gai, Yuanxing Zhang, Xu Sun
cs.AI
摘要
本文提出全視角密集描述生成任務,該創新任務旨在生成具明確時間戳的連續、細粒度、結構化音視覺敘事。為確保密集語義覆蓋,我們引入六維結構化框架來創建「劇本式」描述,使讀者能如電影劇本般逐場景生動想像影片內容。為推動研究,我們構建了高質量人工標註基準OmniDCBench,並提出SodaM統一評估指標,該指標在評估時間感知細節描述的同時能緩解場景邊界模糊問題。此外,我們創建了訓練數據集TimeChatCap-42K,並提出基於SFT與GRPO訓練的強基線模型TimeChat-Captioner-7B,該模型採用任務特定獎勵機制。大量實驗表明,TimeChat-Captioner-7B實現了最先進的性能,超越Gemini-2.5-Pro,其生成的密集描述顯著提升了音視覺推理(DailyOmni與WorldSense)與時間定位(Charades-STA)的下游任務能力。所有數據集、模型與代碼將公開於https://github.com/yaolinli/TimeChat-Captioner。
English
This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.