EverAnimate: 潜在フロー復元による分単位の人間アニメーション

要旨

我々は、長期間にわたるアニメーション動画生成のための効率的な後訓練手法であるEverAnimateを提案する。本手法は、画質とキャラクターの同一性を維持することを目的とする。長尺アニメーションは、比較的静的な環境に対して高いダイナミクスを持つ人間の動作を合成する必要があるため、チャンクベースの生成では累積的なドリフトが生じやすいという課題がある。具体的には、(i) 低レベルの品質ドリフト（静的背景の漸進的な劣化など）、および(ii) 高レベルの意味的ドリフト（キャラクターの一貫性の欠如や視点依存属性の不一致）である。この問題に対処するため、EverAnimateは、持続的な潜在コンテキストメモリに生成を固定することで、ドリフトした流れの軌跡を復元する。本手法は、2つの相補的なメカニズムから構成される。(i) 持続的潜在伝播（Persistent Latent Propagation）は、チャンク間でコンテキストメモリを保持し、潜在空間における同一性と動作を伝播させると同時に、時間的な忘却を緩和する。(ii) 復元的流れマッチング（Restorative Flow Matching）は、サンプリング中の速度調整を通じて暗黙的な復元目的を導入し、チャンク内の忠実度を向上させる。軽量なLoRAチューニングのみで、EverAnimateは短期および長期の両方の設定において、最先端の長尺アニメーション手法を凌駕する。10秒では、PSNR/SSIMが8%/7%向上し、LPIPS/FIDが22%/11%低減される。90秒では、その改善はそれぞれ15%/15%および32%/27%に拡大する。

English

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.