SCAIL-2：以端到端情境內條件化統一受控角色動畫

摘要

受控角色動畫需要將驅動序列的動作遷移至參考角色上。現有方法高度依賴中間表示，例如用姿勢骨架表示動作、或用遮罩背景表示環境，這無可避免會造成資訊損失。為了解決此問題，我們提出 SCAIL-2 框架，該框架跳過這些中間表示，實現端到端的角色動畫。透過直接將驅動影片與序列串接，模型能夠從輸入影片中獲取所有必要的視覺資訊。為了解決缺乏端到端資料的問題，我們以解耦條件統一角色動畫的子任務，並設計一個流程來合成 MotionPair-60K，這是一個包含角色動畫異質任務的端到端動作遷移資料集。為了實現統一，我們利用上下文遮罩條件化與模式特定旋轉位置編碼，作為文字指令與原始視覺資訊之外的軟性引導。為了解決細部區域的合成差異，我們提出偏誤感知直接偏好優化，建構偏好項目來減輕誤差。大量實驗證明，我們的方法在多種角色動畫任務中明顯優於現有最先進的方法。我們將在專案頁面（https://teal024.github.io/SCAIL-2/）釋出合成資料的較大子集以及模型權重。

English

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, an framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To archive the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.