SCAIL-2：通过端到端上下文条件化统一可控角色动画

摘要

可控角色动画需要将驱动序列中的动作迁移到参考角色上。现有工作严重依赖中间表示，包括用于表示动作的姿态骨架或用于表示环境的遮罩背景，这不可避免地导致信息丢失。为解决这一问题，我们提出SCAIL-2框架，该框架绕过了这些中间表示，实现了端到端的角色动画。通过直接将驱动视频与序列拼接，模型可以从输入视频中获取所有必要的视觉信息。为解决端到端数据不足的问题，我们将角色动画的子任务与解耦条件统一，然后设计了一套流程来合成MotionPair-60K——一个包含角色动画异构任务的端到端动作迁移数据集。为了实现统一，我们利用上下文掩码条件化和模态特定旋转位置编码作为文本指令和原始视觉信息之外的软引导。为解决细节区域的合成差异，我们提出偏差感知直接偏好优化方法来构建偏好项以减少误差。大量实验表明，我们的方法在各种角色动画任务中显著优于现有最先进方法。我们将在项目页面（https://teal024.github.io/SCAIL-2/）发布大部分合成数据以及模型权重。

English

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, an framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To archive the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.