フラッシュビデオ：効率的な高解像度ビデオ生成のための詳細への忠実な流れ

要旨

DiT拡散モデルは、モデル容量とデータ規模の拡張性を活用して、テキストからビデオへの生成において大きな成功を収めています。ただし、テキストプロンプトと整合性の高いコンテンツと動きの忠実度を実現するには、しばしば大規模なモデルパラメータと多数の関数評価（NFEs）が必要です。リアルで視覚的に魅力的な詳細は通常、高解像度の出力に反映されるため、特に単一段階のDiTモデルでは計算要件がさらに増大します。これらの課題に対処するために、私たちは新しい2段階フレームワーク、FlashVideoを提案します。このフレームワークは、モデル容量とNFEsを段階ごとに戦略的に割り当てて生成の忠実度と品質をバランスさせます。最初の段階では、計算効率を向上させるために大規模なパラメータと十分なNFEsを利用した低解像度生成プロセスを通じてプロンプトの忠実度が優先されます。2段階目では、低解像度と高解像度の間でフローマッチングを確立し、最小限のNFEsで細部を効果的に生成します。定量的および視覚的な結果は、FlashVideoが最先端の高解像度ビデオ生成を優れた計算効率で達成していることを示しています。さらに、2段階設計により、ユーザーは完全な解像度生成に踏み切る前に初期出力をプレビューできるため、計算コストや待ち時間を大幅に削減し、商業的実用性を向上させることが可能となります。

English

DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .

フラッシュビデオ：効率的な高解像度ビデオ生成のための詳細への忠実な流れ

FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

要旨

Support