ARC-章节:将小时级视频构建为可导航章节与层级化摘要
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
November 18, 2025
作者: Junfu Pu, Teng Wang, Yixiao Ge, Yuying Ge, Chen Li, Ying Shan
cs.AI
摘要
时长一小时视频内容(如讲座、播客、纪录片)的激增,对高效内容结构化提出了更高需求。然而,现有方法受限于小规模训练数据,其标注通常简短粗糙,难以泛化到长视频中细微的内容转换。我们推出ARC-Chapter——首个基于百万级长视频章节训练的大规模视频分章模型,其特点在于具备双语、时间锚定和层次化的章节标注体系。为实现这一目标,我们通过结构化流程构建了英汉双语章节数据集,将语音转录文本、场景文字和视觉描述统一整合为从简短标题到详细摘要的多层级标注。实验证明,随着数据规模(包括数据量和标注密度)的提升,模型性能呈现明显改善。此外,我们设计了名为GRACE的新型评估指标,该指标融合了多对一分段重叠度与语义相似度,能更准确地反映实际应用中的分章灵活性。大量实验表明,ARC-Chapter以显著优势刷新了当前最优水平,F1分数和SODA分数分别较之前最佳结果提升14.0%和11.3%。更值得注意的是,该模型展现出卓越的迁移学习能力,在YouCook2密集视频描述等下游任务中也实现了性能突破。
English
The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.