ARC-Hunyuan-Video-7B: 実世界のショート動画における構造化されたビデオ理解

要旨

現実世界のユーザー生成ショートビデオ、特にWeChat ChannelやTikTokなどのプラットフォームで配信されるものは、モバイルインターネットを支配しています。しかし、現在の大規模マルチモーダルモデルには、効果的なビデオ検索や推薦、そして新興のビデオアプリケーションの基盤となる、時間構造化された詳細で深いビデオ理解能力が欠けています。現実世界のショートビデオを理解することは、複雑な視覚要素、視覚と音声の両方における高い情報密度、感情表現や視点の伝達に焦点を当てた速いペースのため、実際には困難です。これには、視覚、音声、テキストを含むマルチモーダル情報を効果的に統合する高度な推論が必要です。本研究では、生のビデオ入力から視覚、音声、テキスト信号をエンドツーエンドで処理し、構造化された理解を実現するマルチモーダルモデルARC-Hunyuan-Videoを紹介します。このモデルは、多粒度のタイムスタンプ付きビデオキャプションと要約、オープンエンドのビデオ質問応答、時間的ビデオグラウンディング、ビデオ推論が可能です。自動アノテーションパイプラインからの高品質なデータを活用し、我々のコンパクトな7Bパラメータモデルは、事前学習、指示微調整、コールドスタート、強化学習（RL）事後学習、そして最終的な指示微調整を通じて包括的に訓練されます。我々が導入したベンチマークShortVid-Benchでの定量的評価と定性的比較は、現実世界のビデオ理解におけるその強力な性能を示しており、多様な下流アプリケーションに対してゼロショットまたは少数サンプルでの微調整をサポートします。我々のモデルの現実世界での生産環境への展開は、ユーザーエンゲージメントと満足度の具体的で測定可能な改善をもたらし、その顕著な効率性によって支えられています。ストレステストでは、H20 GPU上で1分間のビデオに対する推論時間がわずか10秒であることが示されています。

English

Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.

ARC-Hunyuan-Video-7B: 実世界のショート動画における構造化されたビデオ理解

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

要旨

Support