EverAnimate：基於潛在流修復的分鐘級人體動畫

摘要

我們提出EverAnimate，一種高效的後訓練方法，用於長時程動畫影片生成，能維持視覺品質與角色身分。長篇動畫仍具挑戰性，因為必須在相對靜態的環境中合成高度動態的人體動作，使得基於區塊的生成容易產生累積漂移：(i) 低階品質漂移（例如靜態背景逐漸退化），以及(ii) 高階語意漂移（例如角色身分與視角相關屬性不一致）。為解決此問題，EverAnimate透過將生成錨定於持久的潛在上下文記憶來恢復漂移的光流軌跡，該記憶由兩種互補機制組成：(i) 持久潛在傳播，在區塊間維護上下文記憶，以在潛在空間中傳播身分與動作，同時減輕時間遺忘；(ii) 恢復性光流匹配，透過速度調整在取樣過程中引入隱式恢復目標，提升區塊內的保真度。僅透過輕量級LoRA微調，EverAnimate在短時程與長時程設定下皆優於現有最先進的長動畫生成方法：在10秒時，PSNR/SSIM提升8%/7%，LPIPS/FID降低22%/11%；在90秒時，增益分別提升至15%/15%與32%/27%。

English

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.