DragAnything: エンティティ表現を用いた任意の物体に対するモーション制御

要旨

本論文では、DragAnythingを紹介する。これは、エンティティ表現を利用して、制御可能な映像生成において任意の物体のモーション制御を実現するものである。既存のモーション制御手法と比較して、DragAnythingはいくつかの利点を提供する。まず、軌道ベースのアプローチは、他のガイダンス信号（例：マスク、深度マップ）の取得が労力を要する場合に、ユーザーインタラクションにおいてより使いやすい。ユーザーはインタラクション中に線（軌道）を描くだけでよい。次に、我々のエンティティ表現は、任意の物体を表現可能なオープンドメインの埋め込みとして機能し、背景を含む多様なエンティティのモーション制御を可能にする。最後に、我々のエンティティ表現は、複数の物体に対する同時かつ個別のモーション制御を可能にする。広範な実験により、DragAnythingがFVD、FID、およびユーザースタディにおいて最先端の性能を達成し、特に物体のモーション制御において、従来の手法（例：DragNUWA）を26%上回るヒューマンボーティング結果を示すことが実証された。

English

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw a line (trajectory) during interaction. Secondly, our entity representation serves as an open-domain embedding capable of representing any object, enabling the control of motion for diverse entities, including background. Lastly, our entity representation allows simultaneous and distinct motion control for multiple objects. Extensive experiments demonstrate that our DragAnything achieves state-of-the-art performance for FVD, FID, and User Study, particularly in terms of object motion control, where our method surpasses the previous methods (e.g., DragNUWA) by 26% in human voting.

DragAnything: エンティティ表現を用いた任意の物体に対するモーション制御

DragAnything: Motion Control for Anything using Entity Representation

要旨

Support