ARC-Hunyuan-Video-7B:面向现实世界短视频的结构化视频理解
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
July 28, 2025
作者: Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan
cs.AI
摘要
现实世界中用户生成的短视频,尤其是在微信视频号和抖音等平台上广泛传播的内容,已成为移动互联网的主导力量。然而,当前的大型多模态模型缺乏关键的时间结构化、细致且深入的视频理解能力,而这些能力正是高效视频搜索与推荐以及新兴视频应用的基础。理解现实短视频实际上颇具挑战性,原因在于其复杂的视觉元素、视觉与音频中高密度的信息含量,以及注重情感表达与观点传递的快速节奏。这需要高级推理能力,以有效整合包括视觉、音频和文本在内的多模态信息。在本研究中,我们推出了ARC-Hunyuan-Video,一个能够从原始视频输入端到端处理视觉、音频及文本信号,实现结构化理解的多模态模型。该模型具备多粒度时间戳视频描述与摘要生成、开放式视频问答、时间视频定位及视频推理能力。依托自动化标注管道产生的高质量数据,我们通过一套全面的训练方案——包括预训练、指令微调、冷启动、强化学习(RL)后训练及最终指令微调——训练了这款紧凑的7B参数模型。在我们引入的ShortVid-Bench基准上的定量评估及定性对比中,该模型展现了其在现实视频理解方面的强劲性能,并支持零样本或少样本微调以适应多样化的下游应用。该模型在实际生产环境中的部署,已显著提升了用户参与度和满意度,其卓越效率亦得到验证,压力测试显示在H20 GPU上,一分钟视频的推理时间仅需10秒。
English
Real-world user-generated short videos, especially those distributed on
platforms such as WeChat Channel and TikTok, dominate the mobile internet.
However, current large multimodal models lack essential temporally-structured,
detailed, and in-depth video comprehension capabilities, which are the
cornerstone of effective video search and recommendation, as well as emerging
video applications. Understanding real-world shorts is actually challenging due
to their complex visual elements, high information density in both visuals and
audio, and fast pacing that focuses on emotional expression and viewpoint
delivery. This requires advanced reasoning to effectively integrate multimodal
information, including visual, audio, and text. In this work, we introduce
ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual
signals from raw video inputs end-to-end for structured comprehension. The
model is capable of multi-granularity timestamped video captioning and
summarization, open-ended video question answering, temporal video grounding,
and video reasoning. Leveraging high-quality data from an automated annotation
pipeline, our compact 7B-parameter model is trained through a comprehensive
regimen: pre-training, instruction fine-tuning, cold start, reinforcement
learning (RL) post-training, and final instruction fine-tuning. Quantitative
evaluations on our introduced benchmark ShortVid-Bench and qualitative
comparisons demonstrate its strong performance in real-world video
comprehension, and it supports zero-shot or fine-tuning with a few samples for
diverse downstream applications. The real-world production deployment of our
model has yielded tangible and measurable improvements in user engagement and
satisfaction, a success supported by its remarkable efficiency, with stress
tests indicating an inference time of just 10 seconds for a one-minute video on
H20 GPU.