EverAnimate: 通过潜在流修复实现分钟级人体动画

摘要

我们提出了EverAnimate——一种高效的后训练方法，用于生成长时间动画视频，同时保持视觉质量和角色身份一致性。长篇动画的生成仍面临挑战，因为高度动态的人体运动需要在相对静态的环境中合成，这使得基于分块的生成容易出现累积漂移：（i）低层级质量漂移，例如静态背景的逐渐退化；（ii）高层级语义漂移，例如角色身份和视角相关属性的不一致。为解决这一问题，EverAnimate通过将生成过程锚定于持久的潜在上下文记忆来修复漂移的流轨迹，该记忆由两种互补机制构成。（i）持久潜在传播：跨分块维护上下文记忆，在潜在空间中传播身份和运动信息，同时缓解时间遗忘。（ii）恢复性流匹配：在采样过程中通过速度调整引入隐式恢复目标，提升分块内的保真度。仅通过轻量级LoRA微调，EverAnimate在短时间与长时间设定下均优于现有的长动画生成方法：在10秒时，PSNR/SSIM提升8%/7%，LPIPS/FID降低22%/11%；在90秒时，性能增益扩大至15%/15%和32%/27%。

English

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.