DreamActor-H1:基于运动设计扩散Transformer的高保真人机交互演示视频生成
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
June 12, 2025
作者: Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wang, Zerong Zheng, Ming Zhou
cs.AI
摘要
在电子商务与数字营销领域,制作高保真的人与产品展示视频对于有效呈现产品至关重要。然而,现有的大多数框架要么无法同时保留人与产品的身份特征,要么缺乏对人与产品空间关系的理解,导致呈现效果失真、互动不自然。为解决这些难题,我们提出了一种基于扩散变换器(DiT)的框架。我们的方法通过注入成对的人与产品参考信息,并利用额外的掩码交叉注意力机制,同步保留了人物身份及产品特有的细节,如标志与纹理。我们采用3D人体网格模板和产品边界框来提供精确的运动指导,使得手势与产品摆放能够直观对齐。此外,通过结构化文本编码融入类别级语义,增强了帧间小幅旋转变化时的3D一致性。在采用广泛数据增强策略的混合数据集上训练后,我们的方法在保持人与产品身份完整性及生成逼真展示动作方面,均超越了现有最先进技术。项目页面:https://submit2025-dream.github.io/DreamActor-H1/。
English
In e-commerce and digital marketing, generating high-fidelity human-product
demonstration videos is important for effective product presentation. However,
most existing frameworks either fail to preserve the identities of both humans
and products or lack an understanding of human-product spatial relationships,
leading to unrealistic representations and unnatural interactions. To address
these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our
method simultaneously preserves human identities and product-specific details,
such as logos and textures, by injecting paired human-product reference
information and utilizing an additional masked cross-attention mechanism. We
employ a 3D body mesh template and product bounding boxes to provide precise
motion guidance, enabling intuitive alignment of hand gestures with product
placements. Additionally, structured text encoding is used to incorporate
category-level semantics, enhancing 3D consistency during small rotational
changes across frames. Trained on a hybrid dataset with extensive data
augmentation strategies, our approach outperforms state-of-the-art techniques
in maintaining the identity integrity of both humans and products and
generating realistic demonstration motions. Project page:
https://submit2025-dream.github.io/DreamActor-H1/.