SCAIL-2: Het verenigen van gecontroleerde karakteranimatie met end-to-end in-context conditionering

Samenvatting

Gecontroleerde karakteranimatie vereist het overbrengen van beweging van een aansturende reeks naar een referentiekarakter. Eerdere werken vertrouwen sterk op tussenliggende representaties, zoals poseskeletten om beweging weer te geven of gemaskeerde achtergronden om de omgeving weer te geven, wat onvermijdelijk leidt tot informatieverlies. Om dit aan te pakken presenteren we SCAIL-2, een raamwerk dat deze tussenliggende representaties omzeilt en end-to-end karakteranimatie realiseert. Door aansturende video's direct aan de reeks te concatenaten, kan het model alle benodigde visuele informatie uit de invoervideo verkrijgen. Om het gebrek aan end-to-end gegevens aan te pakken, verenigen we subtaken van karakteranimatie met ontkoppelde condities en stellen we vervolgens een pijplijn samen om MotionPair-60K te synthetiseren, een end-to-end bewegingsoverdrachtdataset met heterogene taken van karakteranimatie. Om de unificatie te bereiken, gebruiken we in-context mask conditionering en mode-specifieke RoPE als zachte begeleiding naast tekstuele instructies en ruwe visuele informatie. Om synthetische discrepantie in gedetailleerde regio's aan te pakken, stellen we Bias-Aware DPO voor om preferentie-items te construeren om de fouten te verminderen. Uitgebreide experimenten tonen aan dat onze methode aanzienlijk beter presteert dan bestaande state-of-the-art benaderingen in verschillende karakteranimatietaken. Een grote subset van synthetische gegevens en modelgewichten zullen worden vrijgegeven op onze projectpagina: https://teal024.github.io/SCAIL-2/.

English

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, an framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To archive the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.