RhymeFlow: 非同期ノイズ除去フロースケジューリングによるトレーニング不要の動画生成高速化

要旨

拡散トランスフォーマー（DiT）に基づく動画生成モデルは、映像合成において顕著な性能を達成しているが、3Dアテンションの二乗計算量に起因して、高い推論遅延と計算コストを被っている。既存の高速化手法は主に、スパースアテンションやKVキャッシングなどの技術を通じて、各ノイズ除去ステップ内の計算複雑性を低減する。しかし、それらは標準的な拡散パイプラインの固有の制約、すなわち目標動画シーケンスのすべてのフレームが、全拡散タイムステップにわたって完全で密なノイズ除去プロセスを経なければならないという制約に硬直的に従っている。我々は、隣接フレーム間の対応する内容と動きにより、重要な意味的遷移を持つキーフレームが固定されると、他のフレームの中間状態は多くの場合、より予測可能な軌跡をたどることを観察した。これは、このような均一で密なノイズ除去プロセスが、自然動画データに対して本質的に冗長であることを示している。そこで我々は、異なるフレームのノイズ除去軌跡を分離する学習不要のフレームワークであるRhymeFlowを導入する。具体的には、まず潜在的な意味的進化を支配する疎な一連の重要なキーフレームを特定する。次に、これらのキーフレームのみが構造的一貫性を確保するために密で段階的なノイズ除去を受け、非キーフレームは計算コストを最小化するためにノイズ除去ステップを逐次的にスキップする。しかし、非キーフレームのスキップされた中間状態がキーフレームのノイズ除去ステップにおける時間的一貫性を破壊し、視覚的劣化を引き起こすため、さらに潜在軌道投影モジュールを導入し、キーフレームが完全で時間的に一貫したシーケンス表現と相互作用できるようにする。現在のDiTベースの動画生成モデルにおける広範な実験により、我々の手法は、より高い推論速度と優れた視覚品質を備え、既存のベースラインを凌駕することを示す。

English

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce RhymeFlow, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.