ARC-Hunyuan-Video-7B：面向现实世界短视频的结构化视频理解

摘要

现实世界中用户生成的短视频，尤其是在微信视频号和抖音等平台上广泛传播的内容，已成为移动互联网的主导力量。然而，当前的大型多模态模型缺乏关键的时间结构化、细致且深入的视频理解能力，而这些能力正是高效视频搜索与推荐以及新兴视频应用的基础。理解现实短视频实际上颇具挑战性，原因在于其复杂的视觉元素、视觉与音频中高密度的信息含量，以及注重情感表达与观点传递的快速节奏。这需要高级推理能力，以有效整合包括视觉、音频和文本在内的多模态信息。在本研究中，我们推出了ARC-Hunyuan-Video，一个能够从原始视频输入端到端处理视觉、音频及文本信号，实现结构化理解的多模态模型。该模型具备多粒度时间戳视频描述与摘要生成、开放式视频问答、时间视频定位及视频推理能力。依托自动化标注管道产生的高质量数据，我们通过一套全面的训练方案——包括预训练、指令微调、冷启动、强化学习（RL）后训练及最终指令微调——训练了这款紧凑的7B参数模型。在我们引入的ShortVid-Bench基准上的定量评估及定性对比中，该模型展现了其在现实视频理解方面的强劲性能，并支持零样本或少样本微调以适应多样化的下游应用。该模型在实际生产环境中的部署，已显著提升了用户参与度和满意度，其卓越效率亦得到验证，压力测试显示在H20 GPU上，一分钟视频的推理时间仅需10秒。

English

Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.

ARC-Hunyuan-Video-7B：面向现实世界短视频的结构化视频理解

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

摘要

Support