MiraData：長時間かつ構造化されたキャプションを備えた大規模ビデオデータセット

要旨

Soraの高モーション強度と長い一貫性のある動画は、ビデオ生成の分野に大きな影響を与え、前例のない注目を集めています。しかし、既存の公開データセットは、主に短い動画と低いモーション強度、簡潔なキャプションを含むため、Soraのような動画を生成するには不十分です。これらの問題を解決するため、我々はMiraDataを提案します。これは、動画の長さ、キャプションの詳細度、モーションの強度、視覚的品質において、従来のデータセットを凌駕する高品質なビデオデータセットです。MiraDataは、多様な手動選択されたソースからキュレーションされ、意味的に一貫したクリップを得るためにデータを細心の注意を払って処理します。GPT-4Vを使用して構造化されたキャプションを注釈付けし、4つの異なる視点からの詳細な説明と要約された密なキャプションを提供します。ビデオ生成における時間的一貫性とモーション強度をより適切に評価するために、我々はMiraBenchを導入します。これは、3D一貫性とトラッキングベースのモーション強度メトリクスを追加することで、既存のベンチマークを強化します。MiraBenchには、150の評価プロンプトと、時間的一貫性、モーション強度、3D一貫性、視覚的品質、テキストとビデオの整合性、分布の類似性をカバーする17のメトリクスが含まれています。MiraDataの有用性と有効性を実証するために、我々のDiTベースのビデオ生成モデルであるMiraDiTを使用して実験を行います。MiraBenchでの実験結果は、特にモーション強度において、MiraDataの優位性を示しています。

English

Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength.

MiraData：長時間かつ構造化されたキャプションを備えた大規模ビデオデータセット

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

要旨

Support